What is the disk I/O when docker containers are launched with host mounted volumes and how can I reduce it?

1.6k views Asked by At

When running a docker container with host mounted volumes, both from docker and docker-compose, on RHEL, I observe a large amount of disk I/O (using dstat) before the container is launched. The I/O is associated with the dockerd process, and I am able to clearly increase or reduce the I/O by mounting or removing host volumes. If I do not mount any host volumes, the container launches immediately. If I mount volumes that cover a large part of the file system, the I/O is significant, in my case about 20 Gb that takes about three minutes before the container launches. In some cases, this causes docker-compose up orchestration to simply time out.

A typical run command looks like this

docker run -it --rm --name my_container \
-v /host/app/src:/app:ro,z \                         # host volume defined here
-v my_ro_data:/data/read_only/files:ro,z \           # external named volume
-v /host/data/write:/data/container_output/files:z \ # another host volume
my/image:latest

The I/O occurs regardless if volume is a pre-defined named volume, and regardless if syntax is used to mark it read-only. But when defining an external named volume, it looks like this:

docker volume create \
--driver local \
--opt type=none \
--opt o=bind \
--opt device=/host/data/files \
my_ro_data

I assume the I/O is related to the overlay file system, but I cannot find any clear explanation of what exactly is being written, where it is being written, and how to perhaps optimize a configuration to require less I/O before container launch. It is clearly not the contents of the entire volume, so it would seem to be some sort of differential? However, imagine I have some sort of large scale data pipeline and I want to point my container at host source or target directories with terabytes of files...How can I mount host volumes with less impact to container startup latency?

Update: Based on guidance from @BMitch, I focused on the SELinux related ":z" label.

Brief history: Originally (about a year prior to the post) the mounted volumes were not accessible to the docker container on our RHEL w/SELinux server without this label. Even though --volumes-from is a different cli option, it had the best explanation that other sources were referring to when solving the access issues:

Labeling systems like SELinux require that proper labels are placed on volume content mounted into a container. Without a label, the security system might prevent the processes running inside the container from using the content. By default, Docker does not change the labels set by the OS. To change the label in the container context, you can add either of two suffixes :z or :Z to the volume mount. These suffixes tell Docker to relabel file objects on the shared volumes. The z option tells Docker that two containers share the volume content. As a result, Docker labels the content with a shared content label. Shared volume labels allow all containers to read/write content. The Z option tells Docker to label the content with a private unshared label.

This explanation is accompanied by a warning: elswhere:

Bind-mounting a system directory such as /home or /usr with the Z option renders your host machine inoperable and you may need to relabel the host machine files by hand.

So I used ":z" and sometimes ":ro,z" and everything worked fine.

It turns out, this label is causing the pre-launch disk I/O. I do not understand SELinux security and labels well, but I imagine the I/O is actual altering of file labels when the volume is mounted, and so the more files, the longer the disk I/O.

My observation is that removing the labels, and doing nothing else, results in the same behavior. Meaning the default behavior now from the docker engine is to treat SELinux mounted volumes as if labeled ":z". I believe this is a new behavior that may have been introduced over the past year...or some other system change...because now the volumes are accessible without the label (or maybe the labels are permanent, allowing subsequent docker access).

However, removing the :z does not solve the I/O and long startup time. I then found this github conversation which claims both :z and :Z are potentially dangerous choices and a comment:

If your container does require broader access to system directories, then use of '--security-opt label:disable' with the 'docker run' command is a better alternative. Note that using the above option instead will disable SELinux checks for that container.

So I added this option and, in fact, the volumes were accessible and there was zero (or minimal) disk I/O and startup latency.

That said, I truly do not understand the repercussions of --security-opt label:disable and would welcome any additional advice or explanation.

1

There are 1 answers

5
BMitch On

Several possibilities and it's not clear which based on your question.

If it were the overlay filesystem, that would be unexpected since there's no copying of files to set that up. And that also doesn't mirror your description since it only happens with volumes, and you have an overlay filesystem for every container (assuming the graph driver is set to overlay).

For a rootless docker daemon (dockerd not running as root), you could see the graph driver switch to native, which does mean it copies the image filesystem for every layer and container, which is very expensive. But you would both notice this with lots of disk space being used, and it would happen without volumes.

With a named volume, when that volume is empty and the container is created with the named volume, docker will initialize the named volume with the contents of the image. This includes all files, permissions, ownership, and other metadata. This initialization step is skipped when the named volume already has data, and it doesn't happen with host volumes.

If your issue is specific to the host volumes, the only thing left I can think of is the selinux labels being set with the "z" option on the volume. Otherwise, both named and host volumes are Linux bind mounts by default, and these are typically very quick operations.

Lastly, it's not clear if "before the container launches" includes time taken by the app inside the container to get to a ready state. To separate docker steps from the application inside the container, change the app to something simple like --entrypoint true which will cause the container to immediately exit after being created. If it takes 30 seconds to get that to exit, you know docker is slow, but if it immediately goes to an exited state, then the issue has nothing to do with docker and the problem is what you're running inside the container.