Git: How are copies of a file with a shared history handled?

135 views Asked by At

I backup my CSS userstyles to a git repo like so:

❯ fd                                                                                            
stylus-2021-05-18.json
stylus-2021-05-20.json

These backup files are obviously mostly the same, i.e., stylus-2021-05-18.json is the past history of stylus-2021-05-20.json. How is this handled by git?

Obviously, I could just rename the files to stylus.json and let git handle the version control completely, but I was wondering if git is smart enough that it could work with these files automatically.

2

There are 2 answers

3
joanis On BEST ANSWER

TL;DR

Commits are created as full file snapshots, always, but garbage collection creates commit packs, which efficiently stores similar blobs using diff compression, whether they're from the same file or not.

Intro

My understanding of Git storing "diffs" rather than full files was all wrong. After having done some readings and some experiments, I see that it doesn't matter if you modify a file or create a copy of a file, when you commit the change or the new file, Git creates a brand new blob, every time.

But, that's pretty inefficient, because you end up with a lot of different copies of the same text, with small diffs between blobs. That problem gets fixed when Git creates packs. I don't fully understand how Git searches for things to pack, but inside a pack, it will store some blobs as whole blobs, and some others as diffs from other blobs.

Experiment

# create a big file and commit it
seq 1 1000000 | shuf > bigfile
git add bigfile
git commit -m'bigfile'

At this point, find .git -ls shows me one big blob (3.5MB) storing this 6.9MB file.

# modify the big file and commit the change
echo change >> bigfile
git commit -m'modify bigfile' bigfile

At this point, find .git -ls shows me two big blobs, each about 3.5MB. Seems pretty inefficient to me, but read on...

# Add another big file, similar to the first one, and commit it
cp bigfile bigfile2
echo some trivial change >> bigfile2
git add bigfile2
git commit -m'bigfile2'

Things don't get better: find .git -ls shows me three big blobs, each about 3.5MB!

Now, at some point when you push, Git might pack your sandbox, but we can force that to happen right now: run git gc. That's not just garbage collection, as I incorrectly thought, it's also creating commit packs. After running git gc, find .git -ls now reports a single pack of about 3.2MB. So my three big blobs were identified as similar, better compressed, and stored efficiently. I think this is called "diff compression".

References

Online posts I just read to answer this question:

1
Joachim Sauer On

Purely from a technical perspective is easy: if two files in a git history ever have exactly (byte-for-byte) the same content, then they will reference the same blob object* and the actual content will only be stored once. So if your current version of fileA is the same as fileB from 2 commits ago, then they will still only be stored once in .git sub directory. This works no matter if the files have different names, are in the same commit or another or on different paths: as long as the content is identical, the blob will be reused.

On the other hand: if that happens too often, then that's a sign that you're using version control in a way it's not really meant to be used: a given commit shouldn't contain any "historical data" or "archive": that's what other commits/tags/branches are for. The HEAD of any given branch should contain exactly (and only) the stuff that's currently relevant for that branch. But that part is not something that's technically required: it's just convention on how git is usually used.

* Note that this reuse even goes to directory levels,I.e. if two directories contain identical sub-directories and files, they'll reference the same tree object. This makes storing "very similar" commits very efficient: effectively only the differences will have to be stored in addition. Note that commits are still snapshots and not diffs.