In our development workflow we build images, push them to a registry, and then deploy services from them in a staging cluster. The workflow is severely bogged down by huge sizes of image pushes, because layers built from the exact same codebase on different workstations tend to end up having different hashes. We do understand how docker works (i.e. one bit changes, the layer changes; a layer changes, all subsequent layers also change), but we still believe that there is a lot of layer invalidation going on that isn't explainable by anything we do to our codebase or dependencies, and is exclusively due to the builds being performed on different machines. Our builds aren't terribly platform dependent in principle (we don't compile anything to machine code), and the machines are all x86_64 linux boxes anyway.
What are the tools, strategies and best practices that would help us debug why this is happening and possibly alleviate the situation?
(Important: one known best practice that we currently absolutely cannot afford is moving the build process to a single dedicated machine, possibly in the cloud. Please don't suggest this solution).
You can use a tool such as Dive (https://github.com/wagoodman/dive/) in order to inspect the layers.
Apart from that - can't help much unless I see the Dockerfiles.
Good practice for me - use Docker inside Docker in order to build your images. Usually a similar flow could suffice:
The general idea is to always start on a "clean plate" and once you're done - destroy "the plate" and repeat for the next build.