Sparse checkouts - how does it works

7k views Asked by At

I've been looking for a way to clone only a sub directory of one of my project. Obviously I found this answer. it is well designed and the step-by-step solution explain well how to implement this. Now at the end it refers to documentation which states this:

"Sparse checkout" allows to sparsely populate working directory. It uses skip-worktree bit (see git-update-index(1)) to tell Git whether a file on working directory is worth looking at.

The problem I have is that I can't understand what this mean. And one thing that I've learn with Git is that it is a fantastic tool, but before implementing something, knowing what happens behind the scene can really help you in the long run.

So, here is the question:

How does a sparse checkout works and what is the output ?

2

There are 2 answers

0
ElpieKay On

A commit points to a tree. A tree describes a directory and it points to other trees, blobs or commits. Other trees are sub directories. Blobs are files. Commits are sub modules. Not considering sub modules, git checkout can be seen as "copying" all of these sub directories and files from the database(inside the invisible directory .git) to the work area(by default the same level where .git is). Sparse-checkout copies just some of the subdirectories or files. So sparse-checkout saves the space of the work area. The occupied space by the database isn't saved.

0
VonC On

The new git sparse-checkout command (introduced in Git 2.25 (Q1 2020) comes from a Microsoft contribution based on its Scalar project

At Microsoft, we support the Windows OS repository using VFS for Git (formerly GVFS). VFS for Git uses a virtualized filesystem to bypass many assumptions about repository size, enabling the Windows developers to use Git at a scale previously thought impossible.

While supporting VFS for Git, we identified performance bottlenecks using a custom trace system and collecting user feedback.
We made several contributions to the Git client, including the commit-graph file and improvements to git push and sparse-checkout.

Building on these contributions and many other recent improvements to Git, we began a project to support very large repositories without needing a virtualized filesystem.

Hence the Scalar project, which has transitioned (mid 2021) from a modified version of VFS for Git into a thin shell around core Git features.
The Scalar executable has now been ported to be included in the microsoft/git fork.

It is integrated with Git for Windows 2.38 (Oct. 2022)

The 2020 article "Bring your monorepo down to size with sparse-checkout" from Derrick Stolee explains how sparse checkout is managed nowodays (2020+)

Using sparse-checkout with an existing repository

To restrict your working directory to a set of directories, run the following commands:

git sparse-checkout init --cone
git sparse-checkout set <dir1> <dir2> ...

If you get stuck, run git sparse-checkout disable to return to a full working directory.

The init subcommand sets the necessary Git config options and fills the sparse-checkout file with patterns that mean "only match files in the root directory".

The set subcommand modifies the sparse-checkout file with patterns to match the files in the given directories.
Further, any files that are immediately in a directory that’s a parent to a specified directory are also included.

For example, if you ran git sparse-checkout set A/B, then Git would include files with names A/B/C.txt (immediate child of A/B) and A/D.txt (immediate sibling of A/B) as well as E.txt (immediate sibling of A).

For instance:

The team building the Android app can usually get away with only the files in client/android and run all integration testing with the currently-deployed services.

The Android team needs a much smaller set of files as they work.
This means they can use the git sparse-checkout set command to restrict to that directory:

$ git sparse-checkout set client/android

$ ls
bootstrap.sh*  client/  LICENSE.md  README.md

$ ls client/
android/

$ find . -type f | wc -l
62

https://i2.wp.com/user-images.githubusercontent.com/121322/72286599-50af8e00-35fa-11ea-9025-d7cbb730192c.png?ssl=1


git sparse-checkout uses a sparse index since Git 2.32 (Q1 2021).
See the article "Make your monorepo feel small with Git’s sparse index" from Derrick Stolee.

The sparse index differs from a normal “full” index in one aspect: it can store directory paths with the object ID for its tree object.

This is in addition to the file paths which are paired with blob objects.

Since the cone mode sparse-checkout patterns match on a directory level, we can determine that an entire directory is out of the sparse-checkout cone and replace all of its contained file paths with a single directory path.

https://github.blog/wp-content/uploads/2021/11/Fig-6-sparse-index.png?resize=432%2C314?w=432

The sparse directory entries correspond to directories that are just outside of the sparse-checkout definition.
These directories also have a cache-tree node whose range is only one entry: that sparse directory entry.


With Git 2.36 (Q2 2022), "git update-index"(man), "git checkout-index"(man), and "git clean"(man) are taught to work better with the sparse checkout feature.

See commit b9ca5e2, commit c35e9f5, commit e015d4d, commit 35682ad, commit 88078f5, commit b553ef6, commit 1e9e10e, commit 1624333, commit bb01b26 (11 Jan 2022) by Victoria Dye (vdye).
(Merged by Junio C Hamano -- gitster -- in commit 2f45f3e, 17 Feb 2022)

update-index: integrate with sparse index

Signed-off-by: Victoria Dye
Reviewed-by: Elijah Newren

Enable use of the sparse index with update-index.
Most variations of update-index work without explicitly expanding the index or making any other updates in or outside of update-index.c.

The one usage requiring additional changes is --cacheinfo; if a file inside a sparse directory was specified, the index would not be expanded until after the cache tree is invalidated, leading to a mismatch between the index and cache tree.
This scenario is handled by rearranging add_index_entry_with_check, allowing index_name_stage_pos to expand the index before attempting to invalidate the relevant cache tree path, avoiding cache tree/index corruption.


With Git 2.36 (Q2 2022), the git sparse-checkout cone patterns are better controlled.

See commit 8dd7c47, commit 4ce5043, commit bb8b5e9, commit d526b4d, commit f748012 (19 Feb 2022) by Elijah Newren (newren).
(Merged by Junio C Hamano -- gitster -- in commit 9671764, 06 Mar 2022)

sparse-checkout: reject arguments in cone-mode that look like patterns

Reviewed-by: Derrick Stolee
Signed-off-by: Elijah Newren

In sparse-checkout add/set under cone mode, the arguments passed are supposed to be directories rather than gitignore-style patterns.

However, given the amount of effort spent in the manual discussing patterns, it is easy for users to assume they need to pass patterns such as

/foo/*

or

!/bar/*/

or perhaps they really do ignore the directory rule and specify a random gitignore-style pattern like

*.c

To help catch such mistakes, throw an error if any of the positional arguments:

* starts with any of '/!'
* contains any of '*?[]'  

Inform users they can pass --skip-checks if they have a directory that really does have such special characters in its name.
(We exclude '' because of sparse-checkout's special handling of backslashes; see the MINGW test in t1091.46.)

And, still with 2.36:

With Git 2.36 (Q2 2022), further polishing of git sparse-checkout".

See commit 8dd7c47, commit 4ce5043, commit bb8b5e9, commit d526b4d, commit f748012 (19 Feb 2022) by Elijah Newren (newren).
(Merged by Junio C Hamano -- gitster -- in commit 9671764, 06 Mar 2022)

sparse-checkout: pay attention to prefix for {set, add}

Helped-by: Junio Hamano
Reviewed-by: Derrick Stolee
Signed-off-by: Elijah Newren

In cone mode, non-option arguments to set & add are clearly paths, and as such, we should pay attention to prefix.

In non-cone mode, it is not clear that folks intend to provide paths since the inputs are gitignore-style patterns.
Paying attention to prefix would prevent folks from doing things like

git sparse-checkout add /.gitattributes
git sparse-checkout add '/toplevel-dir/*'

In fact, the former will result in

fatal: '/.gitattributes' is outside repository...

while the later will result in:

fatal: Invalid path '/toplevel-dir': No such file or directory

despite the fact that both are valid gitignore-style patterns that would select real files if added to the sparse-checkout file.

This might lead people to just use the path without the leading slash, potentially resulting in them grabbing files with the same name throughout the directory hierarchy contrary to their expectations.
See also this thread and this one.

Adding prefix seems to just be fraught with error; so for now simply throw an error in non-cone mode when sparse-checkout set/add are run from a subdirectory.