I was asked this question in an interview and i m not sure of the correct answer hence I would like your suggestions.

I was asked whether we should persist production critical data inside of the docker instance or outside of it? What would be my choice and the reasons for it.

Would your answer differ incase we have a non-prod non critical data ?

Back your answers with reasons.

2 Answers

DazWilkin On

Most data should be managed externally to containers and container images. I tend to view data constrained to a container as temporary (intermediate|discardable) data. Otherwise, if it's being captured but it's not important to my business, why create it?

The name "container" is misleading. Containers aren't like VMs where there's a strong barrier (isolation) between VMs. When you run multiple containers on a single host, you can enumerate all their processes using ps aux on the host.

There are good arguments for maintaining separation between processes and data and running both within a single container makes it more challenging to retain this separation.

Unlike processes, files in container layers are more isolated though. Although the layers are manifest as files on the host OS, you can't simply ls a container layer's files from the host OS. This makes accessing the data in a container more complex. There's also a performance penalty for effectively running a file system atop another file system.

While it's common and trivial to move container images between machines (viz docker push and docker pull), it's less easy to move containers between machines. This isn't generally a problem for moving processes as these (config aside) are stateless and easy to move and recreate, but your data is state and you want to be able to move this data easily (for backups, recovery) and increasingly to move amongst a dynamic pool of nodes that perform processing upon it.

Less importantly but not unimportantly, it's relatively easy to perform the equivalent of a rm -rf * with Docker by removing containers (docker container rm ...) and thereby deleting the application and your data.

David Maze On

The two very most basic considerations you should have here:

  1. Whenever a container gets deleted, everything in the container filesystem is lost.
  2. It's extremely common to delete containers; it's required to change many startup options or to update a container to a newer image.

So you don't really want to keep anything "in the container" as its primary data storage: it's inaccessible from outside the container, and will get lost the next time there's a critical security update and you must delete the container.

In plain Docker, I'd suggest keeping

...in the image: your actual application (the compiled binary or its interpreted source as appropriate; this does not go in a volume)

...in the container: /tmp

...in a bind-mounted host directory: configuration files you need to push into the container at startup time; directories of log files produced by the container (things where you as an operator need to directly interact with the files)

...in either a named volume or bind-mounted host directory: persistent data the container records in the filesystem

On this last point, consider trying to avoid this layer altogether; keeping data in a database running "somewhere else" (could be another container, a cloud service like RDS, ...) simplifies things like backups and simplifies running multiple replicas of the same service. A host directory is easier to back up, but on some environments (MacOS) it's unacceptably slow.

My answers don't change here for "production" vs. "non-production" or "critical" vs. "non-critical", with limited exceptions you can justify by saying "it's okay if I lose this data" ("because it's not the master copy of it").