Docker Image Optimization
This post was originally published on Developer.com.
Docker is everything. Some have even claimed that it can prevent crime and famine. (I probably don’t need to ask you to note the sarcastic tone).
But seriously, for the disruption it has brought on in the industry, and the true market gap it has filled for many, it deserves respect. However, like with all tools, Docker has its upsides and downsides. One common pain point we constantly encountered when using Docker was image size. In order to be able to truly leverage Docker to its utmost potential in our daily work, we needed to find a way to optimize the output image significantly, for the most part, since the majority of our clients require offline installations rendering DockerHub unusable to us.
This post is going to dive into a hands-on guide for the process we went through to cut our Dockerfiles down to half in size (using a single example), with the final real working example complete with documentation and all.
So let’s start from the top…
Cloudify + Docker = Awesomeness! Try it today. Go
Docker? Why?
Working on Cloudify has eventually brought us to work with containers. Why, you ask? Well, containers for us provided the following:
- Multi-distro support: We’re able to run our stack on different distros with very minimal adjustments.
- Easier and more robust upgrades: We’re not there yet, but we’re trying to provide an environment where you could easily upgrade different components.
Using containers will allow us to substantially reduce the number of moving parts to a minimum so that we can have more confidence in our upgrade process. Less unknowns. Using containers will also allow us to generate a more unified upgrade flow. - Modularity and Composability: Users will be able to build a different Cloudify topology by replacing or adding services. We’re aiming at having our stack completely composable. That means you’ll be able to deploy our containers on multiple hosts and clustering them easily by using very defined deployment methods. Using containers will allow us to do just that.
New Problems
However, as we started journeying into using Docker, we stumbled onto several problems:
- Most of our customers require offline installations. Thus, we can’t use DockerHub. In turn, this means that we must export or save the images and allow customers to download them and import or load them on Cloudify’s Management machine(s).
- Due to the above, the images should be as small as possible. It’s very easy to create very large Docker images (just install openjdk) and as our stack is an extremely large one (As of the time this post is being written, the stack comprises of Nginx, InfluxDB, Java, Riemann, logstash, Elasticsearch, Erlang, RabbitMQ, Python and NodeJS), we can’t allow our images to grow substantially.
- The stack has to be maintainable. Managing a stack of this complexity on every build is cumbersome. We have to make everything as organized as possible.
- As Cloudify is open source, we would like to provide a way for users/customers to build their own Cloudify stack. This requires that our Dockerfiles are tidy and that the environment’s setup is simplified as much as possible.
So what to do?
Naturally, we need to perform several steps to get to the holy grail of optimized Docker images and Dockerfiles. This is still a work in progress, but we’re getting there. Let’s review how we can optimize our Dockerfiles and take all above considerations into account to achieve our goal.
Orange you glad I didn’t say Dockerfile?
A lot of articles on the web suggest different methods for optimizing Docker images.
As Docker images are made out of writable layers, and each layer is added on top of the previous layer rather than replaces it, it’s important to understand the bits and bytes of building consolidated Dockerfiles.
Let’s take an example Dockerfile and see how we can optimize it. We’ll use a logstash Dockerfile as an example.
We won’t be spending time on learning how to write Dockerfiles though. To learn about the Dockerfile syntax, see Docker’s documentation.
Our Dockerfile:
Now let’s build this image.
docker build.
will produce this:
802.7 MB just for logstash? I don’t think so. Will this work? Yes. Will it scale? No.
Let’s summarize what we did there:
- We declared the base image we’ll be using.
- We copied a NOTICE file into the container.
- We set some environment variables.
- We created the logstash service directory.
- We copied a logstash configuration file.
- We downloaded and extracted logstash.
- We declared a volume for logstash’s logs so that we can mount it into the host or a data container to keep the logs persistent.
- We exposed ports to the underlying host.
- We declared the command which will be executed when the container is started.
Let’s Optimize
We have several problems here.
Our image is bulky for no good reason and our Dockerfile – disorganized, on top of not having any in-file documentation. Additionally, we would like the development process to be as short as possible to waste little time waiting for images to be created.
Please note that I’m not declaring any of the following as best practices but rather potential practices.
These depend on the specific case you’re trying to solve. What is true for one, is not necessarily true for the other. Think for yourselves.
Unnecessary base image
Do we really need to use the ubuntu:trusty image? This image is 188MB while debian:jessie is 122. debian:wheezy is even smaller at 85MB.
Use the smallest base image possible.
Golang is statically linked. This allows us to use images like scratch or BusyBox which take several MBs only. If we’re running Go, we might use those images to end up with an image a few tens of MBs in size.
In our case, we require java to run. We could try and use either scratch or BusyBox, install apt and the rest of the dependencies on them to build the image and reduce its size substantially.
By any means, go ahead and use BusyBox with only the dependencies required for your application to run.
For this example, we’ll use debian:jessie instead.
So we would do something like this:
FROM debian:jessie
And get:
While this isn’t necessarily a must, it just makes good sense. Why cram all kinds of junk you don’t need into your image?