|
|
Subscribe / Log in / New account

Docker and the OCI container ecosystem

This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

July 26, 2022

This article was contributed by Jordan Webb

Docker has transformed the way many people develop and deploy software. It wasn't the first implementation of containers on Linux, but Docker's ideas about how containers should be structured and managed were different from its predecessors. Those ideas matured into industry standards, and an ecosystem of software has grown around them. Docker continues to be a major player in the ecosystem, but it is no longer the only whale in the sea β€” Red Hat has also done a lot of work on container tools, and alternative implementations are now available for many of Docker's offerings.

Anatomy of a container

A container is somewhat like a lightweight virtual machine; it shares a kernel with the host, but in most other ways it appears to be an independent machine to the software running inside of it. The Linux kernel itself has no concept of containers; instead, they are created by using a combination of several kernel features:

  • Bind mounts and overlayfs may be used to construct the root filesystem of the container.
  • Control groups may be used to partition CPU, memory, and I/O resources for the host kernel.
  • Namespaces are used to create an isolated view of the system for processes running inside the container.

Linux's namespaces are the key feature that allow the creation of containers. Linux supports namespaces for multiple different aspects of the system, including user namespaces for separate views of user and group IDs, PID namespaces for distinct sets of process IDs, network namespaces for distinct sets of network interfaces, and several others. When a container is started, a runtime creates the appropriate control groups, namespaces, and filesystem mounts for the container; then it launches a process inside the environment it has created.

There is some level of disagreement about what that process should be. Some prefer to start an init process like systemd and run a full Linux system inside the container. This is referred to as a "system container"; it was the most common type of container before Docker. System containers continue to be supported by software like LXC and OpenVZ.

Docker's developers had a different idea. Instead of running an entire system inside a container, Docker says that each container should only run a single application. This style of container is known as an "application container." An application container is started using a container image, which bundles the application together with its dependencies and just enough of a Linux root filesystem to run it.

A container image generally does not include an init system, and may not even include a package manager β€” container images are usually replaced with updated versions rather than updated in place. An image for a statically-compiled application may be as minimal as a single binary and a handful of support files in /etc. Application containers usually don't have a persistent root filesystem; instead, overlayfs is used to create a temporary layer on top of the container image. This is thrown away when the container is stopped. Any persistent data outside of the container image is grafted on to the container's filesystem via a bind mount to another location on the host.

The OCI ecosystem

These days, when people talk about containers, they are likely to be talking about the style of application containers popularized by Docker. In fact, unless otherwise specified, they are probably talking about the specific container image format, run-time environment, and registry API implemented by Docker's software. Those have all been standardized by the Open Container Initiative (OCI), which is an industry body that was formed in 2015 by Docker and the Linux Foundation. Docker refactored its software into a number of smaller components; some of those components, along with their specifications, were placed under the care of the OCI. The software and specifications published by the OCI formed the seed for what is now a robust ecosystem of container-related software.

The OCI image specification defines a format for container images that consists of a JSON configuration (containing environment variables, the path to execute, and so on) and a series of tarballs called "layers". The contents of each layer are stacked on top of each other, in series, to construct the root filesystem for the container image. Layers can be shared between images; if a server is running several containers that refer to the same layer, they can potentially share the same copy of that layer. Docker provides minimal images for several popular Linux distributions that can be used as the base layer for application containers.

The OCI also publishes a distribution specification. In this context, "distribution" does not refer to a Linux distribution; it is used in a more general sense. This specification defines an HTTP API for pushing and pulling container images to and from a server; servers that implement this API are called container registries. Docker maintains a large public registry called Docker Hub as well as a reference implementation (called "Distribution", perhaps confusingly) that can be self-hosted. Other implementations of the specification include Red Hat's Quay and VMware's Harbor, as well as hosted offerings from Amazon, GitHub, GitLab, and Google.

A program that implements the OCI runtime specification is responsible for everything pertaining to actually running a container. It sets up any necessary mounts, control groups, and kernel namespaces, executes processes inside the container, and tears down any container-related resources once all the processes inside of it have exited. The reference implementation of the runtime specification is runc, which was created by Docker for the OCI.

There are a number of other OCI runtimes to choose from. For example, crun offers an OCI runtime written in C that has the goal of being faster and more lightweight than runc, which, like most of the rest of the OCI ecosystem, is written in Go. Google's gVisor includes runsc, which provides greater isolation from the host by running applications on top of a user-mode kernel. Amazon's Firecracker is a minimal hypervisor written in Rust that can use KVM to give each container its own virtual machine; Intel's Kata Containers works similarly but supports multiple hypervisors (including Firecracker.)

A container engine is a program that ties these three specifications together. It implements the client side of the distribution specification to retrieve container images from registries, interprets the images it has retrieved according to the image specification, and launches containers using a program that implements the runtime specification. A container engine provides tools and/or APIs for users to manage container images, processes, and storage.

Kubernetes is a container orchestrator, capable of scheduling and running containers across hundreds or even thousands of servers. Kubernetes does not implement any of the OCI specifications itself. It needs to be used in combination with a container engine, which manages containers on behalf of Kubernetes. The interface that it uses to communicate with container engines is called the Container Runtime Interface (CRI).

Docker

Docker is the original OCI container engine. It consists of two main user-visible components: a command-line-interface (CLI) client named docker, and a server. The server is named dockerd in Docker's own packages, but the repository was renamed moby when Docker created the Moby Project in 2017. The Moby Project is an umbrella organization that develops open-source components used by Docker and other container engines. When Moby was announced, many found the relationship between Docker and the Moby project to be confusing; it has been described as being similar to the relationship between Fedora and Red Hat.

dockerd provides an HTTP API; it usually listens on a Unix socket named /var/run/docker.sock, but can be made to listen on a TCP socket as well. The docker command is merely a client to this API; the server is responsible for downloading images and starting container processes. The client supports starting containers in the foreground, so that running a container at the command-line behaves similarly to running any other program, but this is only a simulation. In this mode, the container processes are still started by the server, and input and output are streamed over the API socket; when the process exits, the server reports that to the client, and then the client sets its own exit status to match.

This design does not play well with systemd or other process supervision tools, because the CLI never has any child processes of its own. Running the docker CLI under a process supervisor only results in supervising the CLI process. This has a variety of consequences for users of these tools. For example, any attempt to limit a container's memory usage by running the CLI as a systemd service will fail; the limits will only apply to the CLI and its non-existent children. In addition, attempts to terminate a client process may not result in terminating all of the processes in the container.

Failure to limit access to Docker's socket can be a significant security hazard. By default dockerd runs as root. Anyone who is able to connect to the Docker socket has complete access to the API. Since the API allows things like running a container as a specific UID and binding arbitrary filesystem locations, it is trivial for someone with access to the socket to become root on the host. Support for running in rootless mode was added in 2019 and stabilized in 2020, but is still not the default mode of operation.

Docker can be used by Kubernetes to run containers, but it doesn't directly support the CRI specification. Originally, Kubernetes included a component called dockershim that provided a bridge between the CRI and the Docker API, but it was deprecated in 2020. The code was spun out of the Kubernetes repository and is now maintained separately as cri-dockerd.

containerd & nerdctl

Docker refactored its software into independent components in 2015; containerd is one of the fruits of that effort. In 2017, Docker donated containerd to the Cloud Native Computing Foundation (CNCF), which stewards the development of Kubernetes and other tools. It is still included in Docker, but it can also be used as a standalone container engine, or with Kubernetes via an included CRI plugin. The architecture of containerd is highly modular. This flexibility helps it to serve as a proving ground for experimental features. Plugins may provide support for different ways of storing container images and additional image formats, for example.

Without any additional plugins, containerd is effectively a subset of Docker; its core features map closely to the OCI specifications. Tools designed to work with Docker's API cannot be used with containerd. Instead, it provides an API based on Google's gRPC. Unfortunately, concerned system administrators looking for access control won't find it here; despite being incompatible with Docker's API, containerd's API appears to carry all of the same security implications.

The documentation for containerd notes that it follows a smart client model (as opposed to Docker's "dumb client"). Among other differences, this means that containerd does not communicate with container registries; instead, (smart) clients are required to download any images they need themselves. Despite the difference in client models, containerd still has a process model similar to that of Docker; container processes are forked from the containerd process. In general, without additional software, containerd doesn't do anything differently from Docker, it just does less.

When containerd is bundled with Docker, dockerd serves as the smart client, accepting Docker API calls from its own dumb client and doing any additional work needed before calling the containerd API; when used with Kubernetes, these things are handled by the CRI plugin. Other than that, containerd didn't really have its own client until relatively recently. It includes a bare-bones CLI called ctr, but this is only intended for debugging purposes.

This changed in December 2020 with the release of nerdctl. Since its release, running containerd on its own has become much more practical; nerdctl features a user interface designed to be compatible with the Docker CLI and provides much of the functionality Docker users would find missing from a standalone containerd installation. Users who don't need compatibility with the Docker API might find themselves quite happy with containerd and nertdctl.

Podman

Podman is an alternative to Docker sponsored by Red Hat, which aims to be a drop-in replacement for Docker. Like Docker and containerd, it is written in Go and released under the Apache 2.0 License, but it is not a fork; it is an independent reimplementation. Red Hat's sponsorship of Podman is likely to be at least partially motivated by the difficulties it encountered during its efforts to make Docker's software interoperate with systemd.

On a superficial level, Podman appears nearly identical to Docker. It can use the same container images, and talk to the same registries. The podman CLI is a clone of docker, with the intention that users migrating from Docker can alias docker to podman and mostly continue with their lives as if nothing had changed.

Originally, Podman provided an API based on the varlink protocol. This meant that while Podman was compatible with Docker on a CLI level, tools that used the Docker API directly could not be used with Podman. In version 3.0, the varlink API was scrapped in favor of an HTTP API, which aims to be compatible with the one provided by Docker while also adding some Podman-specific endpoints. This new API is maturing rapidly, but users of tools designed for Docker would be well-advised to test for compatibility before committing to switch to Podman.

As it is largely a copy of Docker's API, Podman's API doesn't feature any sort of access control, but Podman has some architectural differences that may make that less important. Podman gained support for running in rootless mode early on in its development. In this mode, containers can be created without root or any other special privileges, aside from that small bit of help from newuidmap and newgidmap. Unlike Docker, when Podman is invoked by a non-root user, rootless mode is used by default.

Users of Podman can also dodge security concerns about its API socket by simply disabling it. Though its interface is largely identical to the Docker CLI, podman is no mere API client. It creates containers for itself without any help from a daemon. As a result, Podman plays nicely with tools like systemd; using podman run with a process supervisor works as expected, because the processes inside the container are children of podman run. The developers of Podman encourage people to use it in this way by a command to generate systemd units for Podman containers.

Aside from its process model, Podman caters to systemd users in other ways. While running an init system such as systemd inside of a container is antithetical to the Docker philosophy of one application per container, Podman goes out of its way to make it easy. If the program to run specified by the container is an init system, Podman will automatically mount all the kernel filesystems needed for systemd to function. It also supports reporting the status of containers to systemd via sd_notify(), or handing the notification socket off to the application inside of the container for it to use directly.

Podman also has some features designed to appeal to Kubernetes users. Like Kubernetes, it supports the notion of a "pod", which is a group of containers that share a common network namespace. It can run containers using Kubernetes configuration files and also generate Kubernetes configurations. However, unlike Docker and containerd, there is no way for Podman to be used by Kubernetes to run containers. This is a deliberate omission. Instead of adding CRI support to Podman, which is a general-purpose container engine, Red Hat chose to sponsor the development of a more specialized alternative in the form of CRI-O.

CRI-O

CRI-O is based on many of the same underpinnings as Podman. So the relationship between CRI-O and Podman could be said to be similar to the one between containerd and Docker; CRI-O delivers much of the same technology as Podman, with fewer frills. This analogy doesn't stretch far, though. Unlike containerd and Docker, CRI-O and Podman are completely separate projects; one is not embedded by the other.

As might be suggested by its name, CRI-O implements the Kubernetes CRI. In fact, that's all that it implements; CRI-O is built specifically and only for use with Kubernetes. It is developed in lockstep with the Kubernetes release cycle, and anything that is not required by the CRI is explicitly declared to be out of scope. CRI-O cannot be used without Kubernetes and includes no CLI of its own; based on the stated goals of the project, any attempt to make CRI-O suitable for standalone use would likely be viewed as an unwelcome distraction by its developers.

Like Podman, the development of CRI-O was initially sponsored by Red Hat; like containerd, it was later donated to the CNCF in 2019. Although they are now both under the aegis of the same organization, the narrow focus of CRI-O may make it more appealing to Kubernetes administrators than containerd. The developers of CRI-O are free to make decisions solely on the basis of maximizing the benefit to users of Kubernetes, whereas the developers of containerd and other container engines have many other types of users and uses cases to consider.

Conclusion

These are just a few of the most popular container engines; other projects like Apptainer and Pouch cater to different ecological niches. There are also a number of tools available for creating and manipulating container images, like Buildah, Buildpacks, skopeo, and umoci. Docker deserves a great deal of credit for the Open Container Initiative; the standards and the software that have resulted from this effort have provided the foundation for a wide array of projects. The ecosystem is robust; should one project shut down, there are multiple alternatives ready and available to take its place. As a result, the future of this technology is no longer tied to one particular company or project; the style of containers that Docker pioneered seems likely to be with us for a long time to come.


Index entries for this article
GuestArticlesWebb, Jordan


(Log in to post comments)

Docker and the OCI container ecosystem

Posted Jul 26, 2022 20:29 UTC (Tue) by amarao (guest, #87073) [Link]

Excellent overview of the ecosystem.

I found one small mistake. Files on overlayfs for container is not dropped when container stops, only if it's 'rebuilded', that, delete and recreated. Of you stop container and then star it again, all newly created files will be there.

Docker and the OCI container ecosystem

Posted Jul 26, 2022 20:39 UTC (Tue) by jordan (subscriber, #110573) [Link]

Hm, maybe s/stopped/destroyed/ would be slightly more accurate. The intention I'm trying to convey there is that the files don't stick around after you're done with the container.

Docker and the OCI container ecosystem

Posted Jul 26, 2022 21:29 UTC (Tue) by amarao (guest, #87073) [Link]

Yes. And the main reason people confuse that (assuming that 'stop' is a 'drop') is beause it's not possible to change container (without low-level hacks). If you want to update image or tweak manifest, you need to rebuild image, and new image means new container.

Docker and the OCI container ecosystem

Posted Jul 26, 2022 20:31 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

One important part of containers is _building_ the images.

I personally am completely disgusted by Dockerfiles, but it doesn't look like there are any mature alternatives out there. Has this changed recently?

Docker and the OCI container ecosystem

Posted Jul 26, 2022 20:49 UTC (Tue) by jordan (subscriber, #110573) [Link]

Lately I've been getting into Earthly but if you are disgusted by Dockerfiles you probably won't like it; it's somewhat like a cross between a Dockerfile and a Makefile.

Buildah offers a more shell-driven workflow - it's kind of like an "exploded" Dockerfile. Instead of writing your build steps in a Dockerfile, each step is a separate buildah command, which you can run interactively, or via a shell script, or via whatever else.

I've played a bit with ansible-bender, which builds on top of Buildah and seems like an entirely more reasonable way to build a container, but it needs to start with a base image that has a Python capable of running Ansible.

Buildpacks is higher-level and works kind of like Heroku; it detects what language your application is written in, and then picks appropriate builder and base images. It can rebase application images on top of newer base images without rebuilding the whole application.

Docker and the OCI container ecosystem

Posted Jul 26, 2022 21:03 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

Thanks for Earthly!

I actually like it a lot from reading the docs. It seems like a small, but practical improvement over Docker. At the very least they fixed the moronic behavior of Docker's COPY command. For those who don't know, "COPY src1 src2 target" in Docker means "cp src1/* src2/* target", making it impossible to copy multiple paths as one layer.

Yeah, their examples still have the "RUN apk --upgrade" nonsense that makes builds unreproducible. But I don't think it's possible to work around this without way more complex infrastructure.

The last time I looked, buildah was not working well on macOS and Windows. Has this changed?

> It can rebase application images on top of newer base images without rebuilding the whole application.

I have my own bag of tricks for that. For example, I copy go.mod/go.sum or Conanfile in a separate layer before the application and make sure all deps are built before copying in the source code.

Docker and the OCI container ecosystem

Posted Jul 26, 2022 21:11 UTC (Tue) by jordan (subscriber, #110573) [Link]

> Yeah, their examples still have the "RUN apk --upgrade" nonsense that makes builds unreproducible. But I don't think it's possible to work around this without way more complex infrastructure.

I've seen a project that implements package version lockfiles with apt, but the name escapes me at the moment. At a previous employer, we had internal Debian "mirrors" maintained with aptly that held on to every version of the package they'd ever seen, and explicitly specified the version of each package in a manifest file. As you say, more complex infrastructure.

> The last time I looked, buildah was not working well on macOS and Windows. Has this changed?

I don't think it has built in "remote client to a managed VM" stuff like Podman and Docker have gone out of their way to add to their stuff. I bet you could run it in a privileged container on a Podman/Docker on Windows/macOS setup, though.

Docker and the OCI container ecosystem

Posted Jul 26, 2022 21:21 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

> At a previous employer, we had internal Debian "mirrors" maintained with aptly that held on to every version of the package they'd ever seen, and explicitly specified the version of each package in a manifest file. As you say, more complex infrastructure.

There is a Debian snapshot service ( https://snapshot.debian.org/ ) that preserves all the Debian history. So you can just refer to timestamped sources in your APT config. E.g.: https://snapshot.debian.org/archive/debian/20220720T151841Z/

I'm not sure this service can survive if more people start using it, though. It's pretty slow as it is :( We pipe it through our own proxy that basically stores every artifact, to make sure we don't bring it down.

Docker and the OCI container ecosystem

Posted Jul 26, 2022 21:24 UTC (Tue) by jordan (subscriber, #110573) [Link]

Sure, but using Debian snapshots would mean that you'd have to take all the updates in the snapshot that you moved to at once, and that you'd have to take updates in the order they were supplied upstream. Having a packrat mirror that holds on to all the versions gives you more flexibility in deciding what you want to update and when.

Docker and the OCI container ecosystem

Posted Jul 26, 2022 21:34 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

> Sure, but using Debian snapshots would mean that you'd have to take all the updates in the snapshot that you moved to at once

Yeah, but this is usually OK. It also makes it easier to audit dockerfiles to check if they cover all CVEs in the base Debian image.

We also have a script that checks if an image contains packages that are different between two snapshots, this helps to automate "empty" version bumps. Not perfect, but it helps.

We also tried Nix that gives strong reproducibility gurantees, but it wastes way too much time on rebuilding everything.

Docker and the OCI container ecosystem

Posted Jul 27, 2022 9:00 UTC (Wed) by cortana (subscriber, #24596) [Link]

Buildpacks is higher-level and works kind of like Heroku; it detects what language your application is written in, and then picks appropriate builder and base images. It can rebase application images on top of newer base images without rebuilding the whole application.

For another take on this approach, see Red Hat's source-to-image which invokes an assemble script at a well known location in the base container image; this script embeds all the build logic, so that the developer doesn't need to write a Dockerfile, only comply with the conventions of the base image.

Docker and the OCI container ecosystem

Posted Jul 26, 2022 23:17 UTC (Tue) by anguslees (subscriber, #7131) [Link]

Nix, bazel, and yocto/openembeded are some mature systems that can build container images without actually using docker or Dockerfiles - they just produce the layer tar files (OCI image) directly.

I agree there should be more tools like this - it's not that hard. The thing you lose with these approaches is any easy "bootstrap" step to set up the build system in the first place. If you put the build system in a docker container, then you get back to something like a multistage Dockerfile anyway (for simple projects - complex projects can still benefit from these more advanced systems).

Also be aware of cross compiling. In the FOSS community, we would like to support our non-intel brethren and this requires building images for platforms you don't have by default. Some systems (yocto) are good at this, others (bazel) are still maturing. Docker buildx makes this very easy through transparent emulation, again if you're willing to settle for Dockerfiles.

For me, the combination of cross compiling and easy bootstrap means I use docker buildx for all my simpler projects, and just put up with the awkward Dockerfile syntax :-(

Docker and the OCI container ecosystem

Posted Jul 26, 2022 23:39 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

> Nix, bazel, and yocto/openembeded are some mature systems that can build container images without actually using docker or Dockerfiles - they just produce the layer tar files (OCI image) directly.

Do they support layering? This is one feature of Docker that makes it bearable, especially for iterative development.

Docker and the OCI container ecosystem

Posted Jul 31, 2022 19:07 UTC (Sun) by robert_s (subscriber, #42402) [Link]

> Do they support layering?

Answering for Nix and its dockerTools: short answer yes (https://nixos.org/manual/nixpkgs/stable/#ssec-pkgs-docker...), though half of the benefits of using "layers" in docker-land are already solved by Nix's build-caching system. So the advantages gained by "layers" are mostly around image size.

Docker and the OCI container ecosystem

Posted Jul 27, 2022 18:21 UTC (Wed) by rjones (subscriber, #159862) [Link]

Docker files are gross, but there are ways to mitigate the grossness.

The big one I do is to simply have a shell script that does all the work of setting up the docker image. The dockerfile only copies in the script and required assets and the script does all the work of actually setting it up. Of course you can do the same thing with makefiles or scons or whatever build tool you like using. This reduces things to a single COPY and single RUN command.

Along with that it's useful to do multistage builds.

https://docs.docker.com/develop/develop-images/multistage...

This is especially nice if you want to do statically compiled binaries. You end up with a build container and a runtime container. You only then need to distribute the runtime container. This results in a very clean OCI image with all the build-time and setup portions stripped out. It is usually very easy to keep things down well under a 100MB in size.

As far as getting rid of dockerfile itself... It is probably not worth it at this stage. Everybody and their mom supports it and thus it makes it easy to integrate containers into CI/CD workflows, among other things. Sefl-hosting build farms in kubernetes is extremely easy with kaniko. You can use buildah and other associated tools related to podman. Gitlab, github, and most everything else supports building and testing against OCI images.

Docker and the OCI container ecosystem

Posted Jul 27, 2022 20:07 UTC (Wed) by mathstuf (subscriber, #69389) [Link]

I agree. I don't really like Dockerfiles either, but the `COPY` and `RUN` with shell scripts really helps. It also keeps artifacts of what happened in the image itself (no point in deleting since it's in the previous layer anyways).

Other than that, I've found `buildah` to be useful for building minimal images where I can do some logic outside of the container and tell it afterwards. Much better than manual flattening after the fact IMO.

Docker and the OCI container ecosystem

Posted Jul 28, 2022 5:28 UTC (Thu) by bartoc (subscriber, #124262) [Link]

I don't find dockerfiles all that horrid as long as everyone on the team writing them actually understands how the layering works. They are reasonably simple and good enough.

I do prefer buildah's approach though.

Actually I would prefer a scheme that used a content addressable datastore instead of the layering thing, something like ostree or casync (or git, really that would be most convenient, but git has some missing features when dealing with large binary files)

Docker and the OCI container ecosystem

Posted Aug 5, 2022 6:01 UTC (Fri) by hsiangkao (subscriber, #123981) [Link]

you could refer to chunk-based content-addressable Nydus image service [1] and erofs filesystem fscache support for container images [2].

[1] https://github.com/dragonflyoss/image-service/
[2] https://d7y.io/blog/2022/06/06/evolution-of-nydus/

Docker and the OCI container ecosystem

Posted Jul 28, 2022 5:49 UTC (Thu) by pabs (subscriber, #43278) [Link]

Where does your disgust stem from? Mine is that they are mixed-language files. I dislike Makefiles and even printf/regex/SQL for this reason too.

Docker and the OCI container ecosystem

Posted Jul 28, 2022 5:54 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Ohhh.... So many reasons.

1. Dumb syntax. You have \
to write \
very long \
lines that blow up logs because they are actually ran as one long line.

2. The way the layering system works. All of it is kinda crappy.

3. COPY command. There's no way to copy multiple source directories in one layer.

And finally, Dockerfiles pretend to be purely functional and reproducible, but most projects have lines like "RUN apt-get blah" that immediately blow that up.

A true content-addressable system with a better build language than Dockerfiles would be great. Earthly seems to be a good incremental improvement on Dockerfiles.

Docker and the OCI container ecosystem

Posted Aug 19, 2022 7:36 UTC (Fri) by daenzer (subscriber, #7050) [Link]

FWIW, the freedesktop.org GitLab CI templates (https://gitlab.freedesktop.org/freedesktop/ci-templates) use buildah, seems pretty nice.

The templates currently offer two modes of operation for generating images: A simple one where one can just specify the base system packages to be installed, and a generic one which allows running a script to set up the filesystem arbitrarily.

The templates produce a single layer, but it can be based on any other image (by default a standard image of one of the supported base distros), which is included as separate layers. This allows efficient sharing of common base contents.

Docker and the OCI container ecosystem

Posted Sep 3, 2022 7:30 UTC (Sat) by jond (subscriber, #37669) [Link]

I use and work on a pre-processor tool cekit - https://cekit.io - which abstracts (somewhat) over dockerfiles. It’s more declarative and lets you break the recipe up into separate modules that can be shared amongst different container sources (like mix-ins, contrast to container image single inheritance )

Docker and the OCI container ecosystem

Posted Sep 3, 2022 7:54 UTC (Sat) by jond (subscriber, #37669) [Link]

Example image sources for cekit: https://GitHub.com/jboss-container-images/openjdk

Quadlet

Posted Jul 26, 2022 21:09 UTC (Tue) by tau (subscriber, #79651) [Link]

Quadlet is another interesting project that aims to bridge Podman with systemd. It is a systemd generator that consumes .container files and produces systemd service units that launch podman. The idea is that you can create a high level systemd-unit-like file for each of your containers that will then get launched as a native systemd unit. This seems like an improvement over podman's own native systemd integration solution, which recommends that users should type in a long command line string that will manually generate a systemd service file.

The quadlet seems to be in danger of ending up stillborn though, there hasn't been much activity in its GitHub repository lately, which is a shame because I was looking forward to using it once its RPM packaging graduates out of Rawhide.

Quadlet

Posted Jul 27, 2022 19:16 UTC (Wed) by rjones (subscriber, #159862) [Link]

One thing that kinda sucks about podman systemd integration, which may not be avoidable, is that the files it generates is can be very specific to the version of systemd you are using. I toyed with the idea that I could just use the podman command to generate templates I could use via ansible or whatnot, but it didn't really work out.

That was a while ago, though. Maybe it's better now.

Quadlet

Posted Jul 27, 2022 20:05 UTC (Wed) by mathstuf (subscriber, #69389) [Link]

> I toyed with the idea that I could just use the podman command to generate templates I could use via ansible or whatnot, but it didn't really work out.

`quadlet` mentioned elsewhere in the comments seems like it'd work better. But can't you have Ansible run the generator command as your "sync" rather than expecting specific file contents?

Quadlet

Posted Jul 28, 2022 9:08 UTC (Thu) by valberg (guest, #83862) [Link]

There are plans to integrate Quadlet into the containers/podman project such that they can be released simultaneously. Quadlet and Podman can then also share the generate-systemd logic.

Starting with Podman v4.2 (to be released in early August), you can also use a systemd template to run Kubernetes YAML [1]. That simplifies the workload quite a bit as we only need to feed the YAML file to the template and systemd and Podman will take care of the rest.

[1] https://github.com/containers/podman/blob/main/docs/sourc...

Docker and the OCI container ecosystem

Posted Jul 27, 2022 9:13 UTC (Wed) by cortana (subscriber, #24596) [Link]

Great article! A couple of omissions that occurred to me:

  • CRI-O does have a cli, crictl, but you'd only use it for low level troubleshooting/debugging/etc.
  • There's a whole other container ecosystem that has sprung up around Singularity. The idea here is that we _don't_ want to isolate the code inside the container from the user's home directory, shared filesystems, etc; we just want to use containers for distributing software. You'd use this on e.g., a high performance computing cluster. Personally I think it's a bit of a shame that the community didn't coalesce around Podman for this use case, but there we are...

Docker and the OCI container ecosystem

Posted Jul 27, 2022 20:07 UTC (Wed) by jordan (subscriber, #110573) [Link]

> CRI-O does have a cli, crictl, but you'd only use it for low level troubleshooting/debugging/etc.

I was aware of crictl but left it out of the CRI-O section because it's not specific to CRI-O; it can also be used with containerd or another CRI runtime. The article was getting pretty long and it's hard to mention everything.

> There's a whole other container ecosystem that has sprung up around Singularity.

I'm aware of that ecosystem but haven't played with it. I did give a brief shoutout to Apptainer at the end; it appears to me that that is the new name for Singularity, but I could be wrong about that!

Docker and the OCI container ecosystem

Posted Jul 27, 2022 18:07 UTC (Wed) by pj (subscriber, #4506) [Link]

Just wanted to say thanks for the great overview! I've seen the names of lots of these pieces fly by, it's good to know how they all relate to each other . Any chance you'll do a followup on orchestration systems? there's k8s and some various micro- and mini- self-hosted versions, and then there's docker swarm... are there others? which of those are suitable for production vs toy/exploring ? Maybe there's enough there for a followup ecosystem/survey article.

Docker and the OCI container ecosystem

Posted Jul 27, 2022 20:11 UTC (Wed) by jordan (subscriber, #110573) [Link]

Thanks! A follow-up about orchestration could be interesting but I'd have to do a lot of research. I used to run Swarm but now I mostly deploy my own stuff with Compose and have thus far managed to avoid learning very much about Kubernetes :)

Docker and the OCI container ecosystem

Posted Jul 27, 2022 21:29 UTC (Wed) by mathstuf (subscriber, #69389) [Link]

I've used hand-written `.service` files for podman for some smaller deployments (though `podman-systemd-generator` is likely better long-term) and a larger one uses `podman-compose`[1] to generate its systemd units (since they collaborate, it helps to ensure they all share a common network and set of volumes).

K8s just seemed *way* overkill for the "I have a single machine and just want it to be easy to manage upgrades of containers and the base system".

[1] https://github.com/containers/podman-compose

Docker and the OCI container ecosystem

Posted Jul 27, 2022 19:42 UTC (Wed) by lobachevsky (subscriber, #121871) [Link]

Since version 242 systemd-nspawn can also run OCI bundles. I vaguely remember it having that feature in the ancient past before, before they were called OCI, but being removed because back then the format was kind of fast moving, but I might be wrong, because I can't find any note of it anymore.

Docker and the OCI container ecosystem

Posted Jul 27, 2022 20:03 UTC (Wed) by jordan (subscriber, #110573) [Link]

I also dimly remember it having it in the distant past and support for it being removed. Very cool that it's back!

Docker and the OCI container ecosystem

Posted Jul 27, 2022 20:04 UTC (Wed) by mathstuf (subscriber, #69389) [Link]

Great writeup, thanks. Has there been any movement on updating the container storage to be something smarter that does something like `restic`, `os-tree`, or `casync` that uses a rolling algorithm to break blocks into "chunks" and then allowing sharing at more than just the "layer" level? Is this just being left to different implementations to do as they see fit or is it basically just overwhelmed into irrelevance at this point?

Docker and the OCI container ecosystem

Posted Jul 27, 2022 20:18 UTC (Wed) by jordan (subscriber, #110573) [Link]

This is an area of interest to me - the containers/storage and containers/image libraries used to contain at least some level of ostree support, but I know at least part of that was removed due to nobody using it or being willing to maintain it. I've seen other parts of it that have gotten some work at least semi-recently, though; I haven't yet explored what is currently possible there.

Podman supports <a href="https://www.redhat.com/sysadmin/image-stores-podman">additional image stores</a> - I've experimented with using this feature to store image stores in ostree repositories and then get delta updates instead of having to pull whole layers at once.

I know Balena has some sort of delta update technology for their container runtime but I'm unclear how it works.There are also things like <a href="https://github.com/containerd/stargz-snapshotter">stargz</a> (usable with containerd/nerdctrl) that do interesting things with how images are transferred over the network.

Docker and the OCI container ecosystem

Posted Jul 28, 2022 5:36 UTC (Thu) by bartoc (subscriber, #124262) [Link]

or git! I would _love_ to see git updated to be able to handle this stuff (support for reflinking into working tree, smarter compression heuristics, etc).

Actually what would be _really_ cool would be to actually support delta compression for some common binary file types, images could delta compress using hevc or AV1 (probably AV1), and I can imagine similar schemes for videos, although making those schemes fast is probably a research project and would be somewhat codec specific.

Ditto for executable files.

Docker and the OCI container ecosystem

Posted Jul 28, 2022 22:47 UTC (Thu) by k3ninho (subscriber, #50375) [Link]

I'm not sure I understand the problem space and whether this is a solution, but Joey Hess' git-annex [1] for managing large files is an overlooked but solid tool.

1: https://git-annex.branchable.com/

K3n.

Docker and the OCI container ecosystem

Posted Jul 29, 2022 13:00 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

I agree that `git-annex` is a wonderful solution (far better than `git-lfs` IMO). However, it's reliance on symlinks really sucks for Windows developers.

FWIW, my main issues with `git-lfs` include:

- doesn't reuse git's authentication mechanisms
- all data is only looked for at a single location (i.e., no multi-url remotes)
- fork-based workflows get confused when data is added and removed across some history that you rebase on (it will complain about the missing object but there's no nice way to fetch just that object so you can push without `git lfs fetch --all`)

Docker and the OCI container ecosystem

Posted Jul 29, 2022 12:30 UTC (Fri) by gscrivano (subscriber, #74830) [Link]

one thing we are trying in Podman/Buildah/CRI-O is to use a new file format (still OCI compatible) to make every file addressable, so a "podman pull" would just fetch the files that are not present locally[1]

The layering model still makes some sense with the overlay file system, since when you mount a container, the underlying file system expects that layout anyway. An attempt we are trying to improve this model and make easier to move away from layers is composefs[2]. It is a new file system that can mount the image directly and at the same time have both storage and memory page deduplication for shared files[2].

[1] https://www.redhat.com/sysadmin/faster-container-image-pulls
[2] https://github.com/containers/composefs

Docker and the OCI container ecosystem

Posted Aug 5, 2022 6:04 UTC (Fri) by hsiangkao (subscriber, #123981) [Link]

Hi, you could refer to Nydus image service [1] and erofs fscache support for container images [2].
Rolling hash deduplication + fixed-sized output compression is working in progress.

[1] https://github.com/dragonflyoss/image-service/
[2] https://d7y.io/blog/2022/06/06/evolution-of-nydus/

Docker and the OCI container ecosystem

Posted Jul 28, 2022 5:32 UTC (Thu) by bartoc (subscriber, #124262) [Link]

systemd actually has native support for running "container based" services called "portable services"

I think it might even be able to deal with OCI images. In this scheme all the configuration of what to start (and of cgroups) is done in the normal systemd unit files, but the files are contained in some fs image, and that fs image is also set up with all the needed dependencies and such.

The portable service basically just handles the filesystem bits of containers, the rest is handled by existing systemd mechanisms (systemd already manages cgroups for you, and can create user namespaces and such as well).

Docker and the OCI container ecosystem

Posted Aug 1, 2022 17:04 UTC (Mon) by ctalledo (subscriber, #80668) [Link]

Excellent article, thanks! On the runtimes section, one option that may be worth mentioning is Sysbox which is a runc alternative that enhances container isolation and workloads using pure OS virtualization (I am one of the developers). Sysbox enables the user-namespace on all containers, virtualizes /proc and /sys within them, traps sensitive syscalls, and enables containers to run systemd, Docker, K8s, K3s, and more seamlessly. The latter means Docker or Kubernetes can be used to deploy not just app containers, but also system containers. Docker recently acquired the company that developed Sysbox (Nestybox) to extend containers isolation and workloads.

Docker and the OCI container ecosystem

Posted Aug 1, 2022 17:06 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

Headed slightly off-topic here, but I have to ask:

> Docker recently acquired the company that developed Sysbox (Nestybox) to extend containers isolation and workloads.

Why does it always seem to be "acquire" and not "fund"?

Rocket containers

Posted Aug 4, 2022 3:37 UTC (Thu) by higuita (guest, #32245) [Link]

rkt was also very good docker replacement, k8s could even use it, but as as redhat aquired CoreOS, rkt was killed in favor of containerd/runc and later podman ... basically redhat killed coreos and almost all their tools :(

Rocket containers

Posted Aug 4, 2022 3:53 UTC (Thu) by rahulsundaram (subscriber, #21946) [Link]

> rkt was also very good docker replacement, k8s could even use it, but as as redhat aquired CoreOS, rkt was killed in favor of containerd/runc and later podman ... basically redhat killed coreos and almost all their tools :(

I don't see how. rkt didn't make the cut but plenty of things from CoreOS continues to be developed including CoreOS itself, along with quay, k8s operators, tectonic (integrated with Openshift) and so forth.

Lack of a lightweight run time

Posted Aug 8, 2022 11:47 UTC (Mon) by rganesan (guest, #1182) [Link]

One thing the container ecosystem lacks is a lightweight embedded runtime. docker runs into 100s of megs with all it's dependencies and podman is not significantly lighter. Even containerd/nerdctl seems boated.Why would I want "docker build" or "docker compose" on a small IOT device? Something like k3s would me nice but without the whole k8s ecosystem.

Docker and the OCI container ecosystem

Posted Aug 18, 2022 12:41 UTC (Thu) by ricab (subscriber, #156192) [Link]

This is a jewel of an article!


Copyright © 2022, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds