Do you remember those days, when you wrote awesome software but you couldn’t install it on someone else’s machine or it crashed there? Though this is never a nice experience, we could always say

Nowadays, that’s not an excuse any more due to containerization.

Very briefly, with containerization, you pack your application and all necessary dependencies into an image. On execution, you run that image as a container. With this, you don’t have to mess around with another person’s system to make your software run. The container and hence your software should just run anywhere if it runs on your machine. This is also useful for data scientists when deploying models that depend on different packages and versions of those. For me, data scientists must know how to create images and containers.

As you all know, Docker is the major player in that realm and Docker images are ubiquitous. This is awesome, as you can start for example databases of different versions side by side without a hassle. Hacking together images for your apps is also very simple. This is due to the large number of base images and the simple definition-language. However, when you hack images together without knowing what you are doing, you have two issues.

You waste disk space as your images become unnecessarily fat.
You waste time by waiting for builds that take too long.

In this article, I want to show you how you can mitigate these two issues. Luckily, this only requires you to know a few tricks and techniques offered by Docker. To make the tutorial fun and useful, I show you how to pack a Python App into a Docker image. You can find all the code referenced below in my Github repository.

Are you ready? Let’s get it on.

The Tutorial

Let’s assume all our code lives in a single Python file main.py. As we are cool kids, we use the latest and greatest Python version, which is 3.8 at the time of writing this article. Our app is just a simple webserver and it depends on pandas, fastapi, and, uvicorn. We store the dependencies in a requirements.txt file. Locally, we develop the app in a virtual environment. This environment resides in a folder named .venv within the same folder as the code (this becomes important soon). Now, we decide to pack all that into a Docker image. To do so, all we have to do is

Use a base image with Python 3.8 available.
Copy over the code and the requirements file.
Install the requirements and dependencies in the image.
Expose a command that runs our App

The first version of our Docker image looks like

FROM python:3.8.0-slim
COPY . /app
RUN apt-get update \
&& apt-get install gcc -y \
&& apt-get clean
WORKDIR app
RUN pip install --user -r requirements.txt
ENTRYPOINT uvicorn main:app --reload --host 0.0.0.0 --port 8080

Apart from our code and requirements, we need to install GCC as FastApi requires that on installation. We build our image by

docker build -t my-app:v1 .

The size of this image is about 683 MB and takes about a minute to build it (excluding downloading the base image). Let’s see how we can reduce that.

Base Image

Regarding the base image, I already made a conscious choice using Python slim. Why did I exactly choose that?

I could have taken for example a full Ubuntu or CentOS Image, which would lead to an image size > 1GB. But, as I only need Python there is no reason for installing all that.

On the lower end of the image size, we could take python:3.8.0-alpine. But, my code relies on pandas, which is a pain to install on alpine. Alpine also has issues regarding stability and security. Furthermore, slim is only ~80MB larger than alpine, which is still fine. For more information about how to choose the optimal Python image, I refer the interested reader to this article.

Build Context

When you build the image, the first line printed to your console says: Sending build context to Docker daemon. On my computer, this took around 5 seconds and 154 MB were sent over. What is happening here? Docker copies over all files and folders that are within the build context, to the daemon. Here, the build context is directory the Dockerfile is stored in. As we only need two text files, 154 MB sounds quite a lot, doesn’t it? The reason for that is that Docker copies over everything, for example, the .venv folder that contains the virtual environment, or the .git folder.

To fix that, you only need to add a file named .dockerignore next to your Dockerfile. In this file, you list line by line what Docker should not copy over. This is like what git does with the .gitignore file. As a small example, let’s say we have a couple of Excel files and PNGs in our folder which we don’t want to copy over. The .dockerignore file looks like

*.xlsx
*.pngvenv
.venv
.git

In our example, after I added this file, “sending build context to docker” only takes a couple of milliseconds, and only 7.2 kb are sent over. I reduced the image size from 683 Mb to 529 Mb, which is roughly the size of the former build context. Nice! Adding a .dockerignore file contributes to both accelerating builds and reducing image size.

Layer Caching

As already said, it takes around 60 seconds to build this image on my machine. Most of the time, I estimate ~99.98%, is used for installing the requirements and dependencies. You may assume now there is not much room for improvement here. But there is when you have to build the image frequently! Why? Docker can make use of Layer Caching.

Every line in a Docker file represents a layer. By adding/removing something from the line, or by changing a file or folder it references, you change the layer. When this happens, this layer and all layers below get rebuild. Otherwise, Docker uses a cached version of that layer. To exploit that, you should structure your Dockerfiles such that

Layers that don’t change often should appear close to the beginning of the Dockerfile. Installing the compiler is a good example here.
Layers that change often should appear close to the end of the Dockerfile. Copying source code is the perfect example here.

Cool. Enough theory, let’s get back to our example.

Assume you don’t change the requirements, and you only update your code. This is quite common when developing software. Now, every time you build your image, these nasty dependencies are reinstalled. Building the image always takes the same amount of time. Annoying! We don’t make use of caching yet.

Here comes the magic new Dockerfile that resolves your problem

FROM python:3.8.0-slim
RUN apt-get update \
&& apt-get install gcc -y \
&& apt-get clean
COPY requirements.txt /app/requirements.txt
WORKDIR app
RUN pip install --user -r requirements.txt
COPY . /appENTRYPOINT uvicorn main:app --reload --host 0.0.0.0 --port 1234

That doesn’t look very magic and different, right? The only thing we did, is to first install GCC and separate copying the requirements and copying the source code.

GCC and dependencies change very rarely. That is why this layer now appears very early. Requirements change also slowly but more frequently than GCC. That is why this layer comes after the GCC one. Our source code changes very often. Consequently copying it over happens late. Now, when we make changes to our source code and rebuild the image, the dependencies are not reinstalled as Docker uses the cached layers. Rebuilding now takes almost no time. That is great as we can spend more time on testing and executing our app!

Multi-Stage Builds

In our example image, we have to install GCC to install FastApi and uvicorn. But, for running the application we don’t need a compiler. Now imagine you not only need GCC but other programs like Git, or CMake, or NPM, or …. Your production image becomes fatter and fatter.

Multi-Stage Builds to our rescue!

With Multi-Stage builds, you can define various images in the same Dockerfile. Each image performs a different step. You can copy files and artifacts produced from one image to another. Most commonly, you have one image for building your app and another one for running it. All you need to do is copying the build artifacts and dependencies from the build image to the app image.

For our example, this looks like

# Here is the build image
FROM python:3.8.0-slim as builder
RUN apt-get update \
&& apt-get install gcc -y \
&& apt-get clean
COPY requirements.txt /app/requirements.txt
WORKDIR app
RUN pip install --user -r requirements.txt
COPY . /app# Here is the production image
FROM python:3.8.0-slim as app
COPY --from=builder /root/.local /root/.local
COPY --from=builder /app/main.py /app/main.py
WORKDIR app
ENV PATH=/root/.local/bin:$PATH
ENTRYPOINT uvicorn main:app --reload --host 0.0.0.0 --port 1234

When we build that, we arrive at a final production image size of 353 MB. This is roughly half the size of our first version. Congratulations, not too bad. Remember the smaller your production image gets, the better it is!

As a side note, multi-stage builds also increase security. Hui, why is that? Assume you need a secret, like an SSH key, to access certain resources at build time. Even if you delete that secret in a later layer, it is still there in the previous layers. This means someone with access to your image can get that secret. With multi-stage builds, you only copy the necessary runtime artifacts. Consequently, the production image never sees the secret and you’ve resolved that issue.

There are many more details on multi-stage builds, and I refer the reader to this article and this article.

Wrap Up

In this article, I have shown you a few easy tips and tricks on how you can create smaller Docker images that are built faster. Remember

Always add a .dockerignore file.
Think about the order of your layers and order them from slow- to fast-changing actions.
Try to use and exploit multi-stage builds.

I hope this will save you some disk space and time in the future.

Thank you for following this post. As always, feel free to contact me for questions, comments, or suggestions.

Subrat's Technical Blog

Wednesday, November 20, 2019

How to Build Slim Docker Images Fast

The Tutorial

Base Image

Build Context

Layer Caching

Multi-Stage Builds

Wrap Up

No comments:

Must Watch YouTube Videos for Databricks Platform Administrators

Report Abuse