We have set up a test cluster with Mesosphere on AWS, in a private VPC. We have some Docker images which are public, which are easy enough to deploy. However most of our services are private images, hosted on the Docker Hub private plan, and require authentication to access.
Mesosphere is capable of private registry authentication, but it achieves this in a not-exactly-ideal way: a HTTPS URI to a .dockercfg file needs to be specified in all Mesos/Marathon task definitions.
As the title suggests, the question is basically: how should the .dockercfg file be hosted within AWS so that access may be restricted to only the Mesos master+slaves as tightly as possible?
Since the Mesos docs are pretty poor on this, I'm going to answer this wiki-style and update this answer as I go.
Strategies that should work
Host it on S3 (with networking-based access restrictions)
Host the .dockercfg file on S3. For better security, you should consider putting it in its own bucket, or otherwise a bucket dedicated to storing secrets. This presents some interesting challenges in creating a security policy that will actually work to lock the S3 bucket down such that only Mesos can see it, but it can be done.
Mesos task configuration:
S3 bucket policy (using a VPC Endpoint):
Note: this policy lets the allowed principal do anything, which is too sloppy for production, but should help when debugging in a test cluster.
You'll also need a VPCE configuration, to give you a VPCE ID to plug into the S3 bucket condition above. (I guess if you don't use VPC endpoints you could just match on a VPC id instead?)
You can check whether this is working by going to the Mesos UI (if you are using DCOS, this is not the pretty DCOS UI) and observing whether tasks with the name of your app appear in either the Active Tasks or Completed Tasks lists.
Tempting strategies that don't work (yet)
Host it on S3 (with signed URLs)
In this S3 variant, rather than use networking-based access restrictions, we use a signed URL to the .dockercfg file instead.
The Mesos task config should look like:
Unfortunately the above S3 signed URL strategy does not work due to Mesos-1686 which observes that any downloaded file retains the remote filename exactly, including the query string, leading to a filename like ".dockercfg?AWSAccessKeyId=foo&Expires=bar&Signature=baz". Since the Docker client does not recognise the file unless it is exactly named ".dockercfg" it fails to see the auth credentials.
Transfer the .dockercfg file directly to each slave
One could SCP the .dockercfg to each Mesos slave. While this is a quick fix, it:
This could be turned into a more viable production approach if automated with a Configuration Management tool like Chef, which would run on the slaves, and pull the .dockercfg file in to the right place.
This will lead to a config like:
Since 'core' is the default user on the CoreOS based Mesos slaves, and the .dockercfg is expected by convention to be in the home directory of the current user that wants to use Docker.
Update: this should have been the most reliable approach, but I have not found a way to do it yet. the app is still eternally stuck in the 'Deploying' phase as far as Marathon is concerned.
Use a keystore service
As we are dealing with usernames and passwords, the AWS Key Management Service (or even CloudHSM at the extreme) thing seems like it should be a good idea - but AFAIK Mesos has no built-in support for this, and we are not handling individual variables but a file.
Troubleshooting
After you have set up your solution of choice, you may find that the .dockercfg file is being pulled down OK but your app is still stuck in the 'Deploying' phase. Check these things...
Ensure your .dockercfg is the right format for the Mesos Docker version
At some point, the format for the 'auth' field was changed. If the .dockercfg you supply doesn't match this format then the docker pull will silently fail. The format that the Mesos Docker version on the cluster slaves expects is:
Do not use port 80 for your app
If you are trying to deploy a Web app, make sure you did not use the host port 80 - it's not written anywhere in the docs, but Mesos Web services require port 80 for themselves, and if you try and take 80 for your own app it will just hang forever. The astute reader will notice that, among other reasons, this is why the Mesosphere "Oinker" Web app binds to the slightly unusual choice of port 0 instead.