Combining Docker Images - a way to compose multiple images into one


From time to time, we need to combine or compose mutliple tools into a single docker image. Instead of running a tedious apt-get/yum install and waiting for your image to build, try just combining the upstream docker images directly into one.

Introduction

Especially for docker images containing only utilities, or ones where there is a main purpose and we need to add a couple of utilities, it would be handy if we could just inject the needed binary from an upstream image.

In this tour, we build a combination of upstream docker image for python with a couple of utilities:

to demonstate how to combine multiple docker images into one.

Such image can be used, for example in a CI/CD build pipeline, providing both the build environment for our Python application and the terraform command to run deployment steps. Everybody reading this has probably some specific use-case in mind, so let me know in the comment, what is your use-case.

Base image

Let us start with base Python image:

FROM python:3.9.16-bullseye
CMD [ "/bin/bash" ]

Add terraform

We add a terraform image in the same Dockerfile, and then, we cut out the terraform binary from it:

FROM hashicorp/terraform:1.4.0 AS terraform

FROM python:3.9.16-bullseye
COPY --from=terraform /bin/terraform /bin/terraform
CMD [ "/bin/bash" ]

If this is the first time you see multiple FROM statements in a Dockerfile, this is a feature called multi-stage build. You might want to check the documentation after completing this tour.

Update 2023-03-17: After a comment on LinkedIn post, it turns out, that COPY --from searches for image name in central registry if there is no previous stage found. Quoting the documentation of COPY:

Optionally COPY accepts a flag --from=<name> that can be used to set the source location to a previous build stage (created with FROM .. AS <name>) that will be used instead of a build context sent by the user. In case a build stage with a specified name can’t be found an image with the same name is attempted to be used instead.

Therefore the Dockerfile may be simplified to just:

FROM python:3.9.16-bullseye
COPY --from=hashicorp/terraform:1.4.0 /bin/terraform /bin/terraform
CMD [ "/bin/bash" ]

After building:

docker build -t python-tools .

Will it work?

> docker run --rm -it python-tools /bin/bash
root@9cb293fcc406:/# terraform --version
Terraform v1.4.0
on linux_amd64

root@9cb293fcc406:/# python --version
Python 3.9.16

It works. But why?

Some pieces of software can be just dropped at our CPU and work. In this case, the upstream terraform image is actually built on Alpine operating system.

Nevertheless the GO compiler builts statically linked binaries. Regardless of GNU libc installed on our Debian operating system is not the same as Alpine’s musl libc, the terraform binary itself does not expect any external compiled code in then runtime. It was built as all-in-one binary.

However, what must be satisfied is CPU architecture. Architecture is in our case amd64. If the upstream image does not have the same architecture as the image we want to inject binary into, we are out of luck.

This was easy, let us continue with one more challenge.

Add curl

FROM hashicorp/terraform:1.4.0 AS terraform


FROM curlimages/curl:7.87.0 AS curl


FROM python:3.9.16-bullseye
COPY --from=terraform /bin/terraform /bin/terraform
COPY --from=curl      /usr/bin/curl  /usr/bin/curl
CMD [ "/bin/bash" ]

How did we know that curl binary is not /bin/curl, but /usr/bin/curl?

Well, we didn’t. Most of the time, easiest way to find out is just to try /bin and /usr/bin and see, where authors of upstream image had put the binary. It may be even /usr/local/bin.

If we would like to take more scientific approach, we could examine the image with which command:

> docker run --entrypoint '' --rm -it curlimages/curl:7.87.0 /bin/sh
/ $ which curl
/usr/bin/curl

Some images, do not have even the which command inside. For example the terraform.

We can take a look on the source Dockerfile, examine build scripts or inspect the docker image.

Usually the binary of interest is ENTRYPOINT or CMD:

> docker inspect hashicorp/terraform:1.4.0 | grep -i entrypoint
            "ENTRYPOINT [\"/bin/terraform\"]"
> docker inspect hashicorp/terraform:1.4.0 | grep -i cmd
            "Cmd": [

Or take a look on how the image was built:

> docker image history --no-trunc hashicorp/terraform:1.4.0
IMAGE        CREATED
sha256:c43   3 days ago    /bin/sh -c #(nop)  ENTRYPOINT ["/bin/terraform"]

After this where is the binary excurse, let us build and try our image:

docker build -t python-tools .

Will it work?

> docker run --rm -it python-tools /bin/bash
root@866400d6b362:/# terraform --version
ptTerraform v1.4.0
on linux_amd64

root@866400d6b362:/# python --version
Python 3.9.16

root@866400d6b362:/# curl --version
bash: /usr/bin/curl: No such file or directory

It does not work. But why?

The error message is not exactly helpful in finding what could actually be the problem. When we examine the upstream curl docker image, we see it is an alpine based image.

We can try to make it work, but spoiler alert: it is not worth doing it. We can change the python image to an alpine based:

FROM python:3.9.16-alpine

After build, next error will be:

> docker run --rm -it python-tools /bin/sh
/ # curl --version
Error loading shared library libcurl.so.4: No such file or directory (needed by /usr/bin/curl)
...

We can start copying lib files to our docker image, but after copying /usr/lib/libcurl.so we will see there are other linked libraries needed. Process will become tedious and it is not worth the time. Easier is to install curl using an OS package manager.

Depending on the tool we would like to use, we may have the luck, that our image is already equipped with its dependencies. If we are on the same architecture (amd64), and same version of an underlying operating system, it may work if we just copy the binary.

Conclusion

When to use this approach

When to use combine approach:

  • when we want to insert a tool, which is not the main software for which we are building our docker image
  • the tool is self-sufficient binary/script and ideally it is confined in one directory, thus it is easy to cut-out (COPY --from)
  • the tool does not have additional dependencies on an operating system or those are already met, e.g. python tools on a python image could work (presuming compatible python version)

In all other cases, it is easier to just

RUN apt update && apt install --no-install-recommeds -y <tool> && apt clean

or similar, based on an underlying OS.

Pros and cons of combined docker images

Pros:

  • when it works, the Dockerfile is easy to write and also easy to understand what is going on (e.g. we just want terraform binary to be present)
  • no hassle around apt/yum/apk caches and whether we clean them or we have larger image because of those
  • fast implementation, if we do this once we already know the tool is self-sufficient. Next time we need some docker image to have the tool, the trick is 5 minutes of work
  • stable versions and deterministic docker image build. We litarally just take layers from other, deterministically chosen docker image. There is no non-deterministic apt update or fetching latest version. The built image is always exactly, binary-wise, the same. This also speeds up, subsequent builds as layers are re-used.

Cons:

  • stable versions. If the image goes to production, we might forget to update our tools/dependencies. Then we are deploying older versions, which may be of a security concern. By using and installing the latest version from upstream repositories, we everytime build an image with latest patched version of tools.

Let me know in the discussion about any other pros and cons you see.

Tools that work

From the experience, generally all GO based tools work (vault, terraform). Scripts, which are just one-file (pass). Java all-in-one uber JAR files, assuming the image already has compatible Java (avro-tools).

Enjoy.