Git: the meaning of object 'size' returned by git verify-pack

2.4k views Asked by At

The git verify-pack command has a -v option which outputs a lot of diagnostic information for each object found in the packfile. However, the value returned by the size field for a deltified object is not matching my hazy expectations - I thought that it would be something like the uncompressed 'true' size of the Git object? What's the actual meaning of this field?

Specifically, I have a Git packfile which contains a large object:

$ git cat-file -s 7daa9e75f86aa168748aef6c16c76b2acee1acca
61464170

(ie the object size is about 58MB, which is indeed what I see when I check the file out)

However, the line returned for this object by git verify-pack -v is this:

7daa9e75f86aa168748aef6c16c76b2acee1acca blob   568352 529608 770759074 1 27e47895a3822906eb31b05fe674ad470296c12e

(a full copy of the verify-pack output is available here)

As you can see (after reading the documentation for git verify-pack), this object is stored deltafied, and the definition of the columns is this:

SHA1 type size size-in-packfile offset-in-packfile depth base-SHA1

So 'size' for this object is 568352 (and 'size-in-packfile' is 529608) - but what does that mean, given that the actual object size is 61464170 bytes? The magnitude-order difference in size must mean that the size figure refers just to the delta?

3

There are 3 answers

4
torek On BEST ANSWER

First, see this diagram. Then, based on the source (builtin/index-pack.c), the value in the fourth field is:

(unsigned long)(obj[1].idx.offset - obj->idx.offset)

which is the raw packed-up size (obj[1] is the next object after this one, or the trailer). As the stored item is deltified, that's the size of the delta-compressed data plus overhead. The value in the third field is obj->size (the first size value from the overhead area).

(To get the actual data, or even its size, you have to inflate the stream a bit and then look at the delta headers. The object's "true" size is encoded in the header as the second size value. See get_size_from_delta in sha1_file.c, get_delta_hdr_size in delta.h, and the "offset encoding" in the diagram.)


Edit to add: OK, re-reading the question, you're asking more about why the fourth size is so much smaller than the third one. That would be because the third one is the inflated (but not de-delta-ed) size of the object. So: size-in-packfile (field 4) is after deflating, but also includes a bit of header overhead; size of delta-compressed file (field 3) is, well, obvious; and size of ultimate file, after undoing delta compression, is in the header whose byte count is included with the size-in-packfile (field 4).

Extra edit: the offset-in-packfile (field 5) is obj->idx.offset. That's where you have to lseek() in the pack file to start reading the object (I think, I've got some confusing code in front of me for handling OBJ_OFS_DELTA too :-) ).

0
Philip Oakley On

There was a recent patch series [RFC/PATCH 0/4] cat-file --batch-disk-sizes discussion which included "[PATCH 07/10] cat-file: add %(objectsize:disk) format atom" which may be of interest if you are into compiling from source.

0
VonC On

With Git 2.21 (Q1 2019), the meaning of "objectsize" is clarified, as the "--format=<placeholder>" option of for-each-ref, branch and tag learned to show a few more traits of objects that can be learned by the object_info API.

See commit 59012fe, commit 5610d9f, commit 33311fa, commit f4ee22b, commit 5305a55, commit 1867ce6 (24 Dec 2018) by Olga Telezhnaya (telezhnaya).
(Merged by Junio C Hamano -- gitster -- in commit 55574bd, 18 Jan 2019)

ref-filter: add objectsize:disk option

Add new formatting option objectsize:disk to know exact size that object takes up on disk.

The git for-each-ref man page now states:

objectsize:

The size of the object (the same as 'git cat-file -s' reports).
Append :disk to get the size, in bytes, that the object takes up on disk.

deltabase:

This expands to the object name of the delta base for the given object, if it is stored as a delta.
Otherwise, it expands to the null object name (all zeroes).

Caveats:

Note that the sizes of objects on disk are reported accurately, but care should be taken in drawing conclusions about which refs or objects are responsible for disk usage.
The size of a packed non-delta object may be much larger than the size of objects which delta against it, but the choice of which object is the base and which is the delta is arbitrary and is subject to change during a repack.

Note also that multiple copies of an object may be present in the object database; in this case, it is undefined which copy's size or delta base will be reported.

So you can compare those values with the one reported by git verify-pack -v, as a git for-each-ref is now (5+ years later: 2013-2018) able to display more data.


Git 2.44 (Q1 2024) illustrates that the objectsize should not be hard-coded.

See commit fbc6526 (12 Dec 2023) by René Scharfe (rscharfe).
(Merged by Junio C Hamano -- gitster -- in commit 6db745e, 27 Dec 2023)

t6300: avoid hard-coding object sizes

Reported-by: Ondrej Pohorelsky
Signed-off-by: René Scharfe

f4ee22b ("ref-filter: add tests for objectsize:disk", 2018-12-24, Git v2.21.0-rc0 -- merge listed in batch #3) hard-coded the expected object sizes.
Coincidentally the size of commit and tag is the same with zlib at the default compression level.

1f5f8f3 ("t6300: abstract away SHA-1-specific constants", 2020-02-22, Git v2.27.0-rc0 -- merge listed in batch #2) encoded the sizes as a single value, which coincidentally also works with sha256.

Different compression libraries like zlib-ng may arrive at different values.
Get them from the file system instead of hard-coding them to make switching the compression library (or changing the compression level) easier.