📜 ⬆️ ⬇️

Three simple steps to reduce Docker images

image

When it comes to creating Docker containers, it's best to always strive to minimize the size of the images. Images that use the same layers and weigh less are transferred and deposited more quickly.


But how to control the size when each execution of the RUN operator creates a new layer? Plus, we still need intermediate artifacts before creating the image itself ...


Perhaps you know that most Docker-files have their own, rather strange, features, for example:


 FROM ubuntu RUN apt-get update && apt-get install vim 

Well, why is it && ? Isn't it easier to run two RUN statements, like here?


 FROM ubuntu RUN apt-get update RUN apt-get install vim 

Starting in Docker version 1.10, the COPY , ADD and RUN operators add a new layer to the image. In the previous example, two layers were created instead of one.


image


Layers like git commits.


Docker layers preserve the differences between the previous and the current version of the image. And as Git-commits, they are convenient if you share them with other repositories or images. In fact, when requesting an image from the registry, only missing layers are loaded, which simplifies the separation of images between containers.


But at the same time, each layer takes place, and the more of them, the heavier the final image. Git repositories are similar in this respect: the size of the repository grows with the number of layers, because it must keep all changes between commits. It used to be good practice to combine several RUN operators in one line, as in the first example. But now, alas, no.


1. Merge several layers into one using phased build of Docker images


When the Git repository grows, you can simply reduce the entire change history to one commit and forget about it. It turned out that something similar can be implemented in Docker - through a phased assembly.


Let's create a Node.js container.


Let's start with index.js :


 const express = require('express') const app = express() app.get('/', (req, res) => res.send('Hello World!')) app.listen(3000, () => { console.log(`Example app listening on port 3000!`) }) 

and package.json :


 { "name": "hello-world", "version": "1.0.0", "main": "index.js", "dependencies": { "express": "^4.16.2" }, "scripts": { "start": "node index.js" } } 

Package the application with the following Dockerfile :


 FROM node:8 EXPOSE 3000 WORKDIR /app COPY package.json index.js ./ RUN npm install CMD ["npm", "start"] 

Create an image:


 $ docker build -t node-vanilla . 

Check that everything works:


 $ docker run -p 3000:3000 -ti --rm --init node-vanilla 

Now you can follow the link: http: // localhost: 3000 and see “Hello World!” There.


In the Dockerfile , we now have the operators COPY and RUN , so we fix an increase of at least two layers compared to the original way:


 $ docker history node-vanilla IMAGE CREATED BY SIZE 075d229d3f48 /bin/sh -c #(nop) CMD ["npm" "start"] 0B bc8c3cc813ae /bin/sh -c npm install 2.91MB bac31afb6f42 /bin/sh -c #(nop) COPY multi:3071ddd474429e1… 364B 500a9fbef90e /bin/sh -c #(nop) WORKDIR /app 0B 78b28027dfbf /bin/sh -c #(nop) EXPOSE 3000 0B b87c2ad8344d /bin/sh -c #(nop) CMD ["node"] 0B <missing> /bin/sh -c set -ex && for key in 6A010… 4.17MB <missing> /bin/sh -c #(nop) ENV YARN_VERSION=1.3.2 0B <missing> /bin/sh -c ARCH= && dpkgArch="$(dpkg --print… 56.9MB <missing> /bin/sh -c #(nop) ENV NODE_VERSION=8.9.4 0B <missing> /bin/sh -c set -ex && for key in 94AE3… 129kB <missing> /bin/sh -c groupadd --gid 1000 node && use… 335kB <missing> /bin/sh -c set -ex; apt-get update; apt-ge… 324MB <missing> /bin/sh -c apt-get update && apt-get install… 123MB <missing> /bin/sh -c set -ex; if ! command -v gpg > /… 0B <missing> /bin/sh -c apt-get update && apt-get install… 44.6MB <missing> /bin/sh -c #(nop) CMD ["bash"] 0B <missing> /bin/sh -c #(nop) ADD file:1dd78a123212328bd… 123MB 

As you can see, the final image has grown by five new layers: one for each operator in our Dockerfile . Let's try a phased Docker build now. We use the same two-part Dockerfile :


 FROM node:8 as build WORKDIR /app COPY package.json index.js ./ RUN npm install FROM node:8 COPY --from=build /app / EXPOSE 3000 CMD ["index.js"] 

The first part of the Dockerfile creates three layers. Then the layers are combined and copied to the second and final stages. Two more layers are added on top of the image. As a result, we have three layers.


image


Let's try. First create the container:


 $ docker build -t node-multi-stage . 

Checking history:


 $ docker history node-multi-stage IMAGE CREATED BY SIZE 331b81a245b1 /bin/sh -c #(nop) CMD ["index.js"] 0B bdfc932314af /bin/sh -c #(nop) EXPOSE 3000 0B f8992f6c62a6 /bin/sh -c #(nop) COPY dir:e2b57dff89be62f77… 1.62MB b87c2ad8344d /bin/sh -c #(nop) CMD ["node"] 0B <missing> /bin/sh -c set -ex && for key in 6A010… 4.17MB <missing> /bin/sh -c #(nop) ENV YARN_VERSION=1.3.2 0B <missing> /bin/sh -c ARCH= && dpkgArch="$(dpkg --print… 56.9MB <missing> /bin/sh -c #(nop) ENV NODE_VERSION=8.9.4 0B <missing> /bin/sh -c set -ex && for key in 94AE3… 129kB <missing> /bin/sh -c groupadd --gid 1000 node && use… 335kB <missing> /bin/sh -c set -ex; apt-get update; apt-ge… 324MB <missing> /bin/sh -c apt-get update && apt-get install… 123MB <missing> /bin/sh -c set -ex; if ! command -v gpg > /… 0B <missing> /bin/sh -c apt-get update && apt-get install… 44.6MB <missing> /bin/sh -c #(nop) CMD ["bash"] 0B <missing> /bin/sh -c #(nop) ADD file:1dd78a123212328bd… 123MB 

See if the file size has changed:


 $ docker images | grep node- node-multi-stage 331b81a245b1 678MB node-vanilla 075d229d3f48 679MB 

Yes, it has become smaller, but not significantly yet.


2. We demolish all the excess from the container using distroless


The current image provides us with Node.js, yarn , npm , bash and many other useful binaries. Also, it is based on Ubuntu. Thus, deploying it, we get a full-fledged operating system with many useful binaries and utilities.


However, we do not need them to run the container. The only dependency needed is Node.js.


Docker-containers should provide work of one process and contain the minimum necessary set of tools for its start. The whole operating system is not required for this.


Thus, we can take out everything from it except Node.js.


But how?


Google has already come to a similar decision - GoogleCloudPlatform / distroless .


The description for the repository reads as follows:


Distroless images contain only the application and the dependencies for its work. There are no package managers, shells and other programs that are usually found in the standard Linux distribution.


This is what you need!


Launch Dockerfile to get a new image:


 FROM node:8 as build WORKDIR /app COPY package.json index.js ./ RUN npm install FROM gcr.io/distroless/nodejs COPY --from=build /app / EXPOSE 3000 CMD ["index.js"] 

We collect the image as usual:


 $ docker build -t node-distroless . 

The application should earn normally. To check, run the container:


 $ docker run -p 3000:3000 -ti --rm --init node-distroless 

And we go to http: // localhost: 3000 . Has the image become easier without extra binaries?


 $ docker images | grep node-distroless node-distroless 7b4db3b7f1e5 76.7MB 

And how! Now it weighs only 76.7 MB, as much as 600 MB less!


Everything is cool, but there is one important point. When the container is running and you need to check it, you can connect using:


 $ docker exec -ti <insert_docker_id> bash 

Connecting to a running container and running bash very similar to creating an SSH session.


But since distroless is a stripped-down version of the original operating system, there are neither additional binaries, nor, actually, a shell!


How to connect to a running container if there is no shell?


The most interesting thing is that.


This is not very good, since only binaries can be executed in a container. And the only one that can be run is Node.js:


 $ docker exec -ti <insert_docker_id> node 

In fact, there is a plus in this, because if suddenly an attacker can gain access to the container, he will do much less damage than if he had access to the shell. In other words, fewer binaries — less weight and better security. But, truth, at the price of more complex debugging.


Here it would be necessary to make a reservation that you should not connect and debug containers on the prod-environment. It is better to rely on properly configured logging and monitoring systems.


But what if we still need debugging, and at the same time we want the docker image to have the smallest size?


3. Reduce Base Images with Alpine


You can replace the distroless Alpine-image.


Alpine Linux is a security-oriented, lightweight distribution based on musl libc and busybox . But let's not believe the word, but rather check.


Launch Dockerfile using node:8-alpine :


 FROM node:8 as build WORKDIR /app COPY package.json index.js ./ RUN npm install FROM node:8-alpine COPY --from=build /app / EXPOSE 3000 CMD ["npm", "start"] 

Create an image:


 $ docker build -t node-alpine . 

Check the size:


 $ docker images | grep node-alpine node-alpine aa1f85f8e724 69.7MB 

At the output we have 69.7MB - it is even less than a distroless-image.


Check whether it is possible to connect to a running container (in the case of the distrolles image, we could not do this).


We start the container:


 $ docker run -p 3000:3000 -ti --rm --init node-alpine Example app listening on port 3000! 

And connect:


 $ docker exec -ti 9d8e97e307d7 bash OCI runtime exec failed: exec failed: container_linux.go:296: starting container process caused "exec: \"bash\": executable file not found in $PATH": unknown 

Unsuccessful. But maybe the container has sh 'ell ...:


 $ docker exec -ti 9d8e97e307d7 sh / # 

Fine! We managed to connect to the container, and at the same time its image is also smaller. But here it was not without nuances.


Alpine images are based on muslc - an alternative standard library for C. While most Linux distributions, such as Ubuntu, Debian and CentOS, are based on glibc. It is believed that both of these libraries provide the same interface for working with the kernel.


However, they have different goals: glibc is the most common and fast, muslc takes up less space and is written with an emphasis on security. When an application is compiled, as a rule, it is compiled under a particular C library. If you need to use it with another library, you will have to recompile.


In other words, assembling containers on Alpine images can lead to unexpected developments, since the standard C library used in it is different. The difference will be noticeable when working with precompiled binaries, such as Node.js extensions for C ++.


For example, the PhantomJS package does not work on Alpine.


So which base image to choose?


Alpine, distroless or vanilla image - to solve, of course, better according to the situation.


If we are dealing with a prod and security is important, perhaps the most appropriate would be distroless.


Each binary added to a Docker image adds a certain risk to the stability of the entire application. This risk can be reduced by having only one binary installed in the container.


For example, if an attacker could find a vulnerability in an application running on the basis of a distroless image, he will not be able to launch a shell in the container, because it is not there!


If for some reason the size of the docker image is extremely important to you, you should definitely look at the images based on Alpine.


They are really small, but true, at the cost of compatibility. Alpine uses a slightly different standard C - muslc library, so sometimes problems will pop up. Examples can be found on the links: https://github.com/grpc/grpc/issues/8528 and https://github.com/grpc/grpc/issues/6126 .


Vanilla images are ideal for testing and development.


Yes, they are big, but they are as close as possible to a full-fledged machine with installed Ubuntu. In addition, all binaries are available in the OS.


Let's summarize the size of the received Docker images:


node:8 681MB
node:8 with step build 678MB
gcr.io/distroless/nodejs 76.7MB
node:8-alpine 69.7MB


Parting words from the translator


Read other articles on our blog:


Stateful backups in Kubernetes


Backup a large number of heterogeneous web-projects


Telegram-bot for Redmine. How to simplify the life of yourself and people



Source: https://habr.com/ru/post/437372/