Undo git checkout.. with a twist

78 views Asked by At

First of all, let me make one thing clear: although there are a LOT of questions about undoing a git checkout, this is not (at least as far as I can assess) a duplicate question.

Now let me explain my use-case: I am using the sparse-checkout feature to have a working copy which does not contain all the files in the central remote repo.

Now let's suppose I want to add a file to my working copy, but I make a mistake and checkout the wrong file.

I want to revert my working copy as if that file was never checked-out.

That is: I want to remove that file from my working copy, but I do not want that file to be removed from the remote repo. I have been looking everywhere but still have not found a way to do what I want.

1

There are 1 answers

0
torek On BEST ANSWER

You literally don't have to do anything. You can do something but it's not required, and if the file you accidentally extracted isn't creating any problems, you should probably just leave it there.

This may require a bit of explaining.

I am using the sparse-checkout feature to have a working copy which does not contain all the files in the central remote repo.

While your working copy can omit some files, your repository cannot omit these files. So you already have them. The only thing the sparse checkout option does is keep them from showing up in your working tree.

You might already know this, but let's review some items about Git to make sure that we have a shared vocabulary:

  • A Git repository consists, in essence, of two databases. The (usually much larger) main database holds commits and other supporting Git objects. The second, usually much smaller, database holds names—branch names, tag names, and other such names—and, for each name, one corresponding object-hash-ID. For branch names, these hash IDs are invariably commit hash IDs; other names can sometimes hold hash IDs of some of the other internal Git objects.

    Both databases are simple key-value stores. Each has an ad-hoc Git-specific implementation, though an off the shelf database would work (though it would be slower and harder to use and manage, or at least, that's the excuse for using a private one).

    All of the objects—including all of the commits—inside the main database are entirely read-only. This is a consequence of the fact that the keys are hash IDs, and the hash IDs are the result of applying a cryptographic checksum algorithm to the contents (the value stored under that key). Git does a verification when extracting the content: the content must hash back to the key. This detects (but cannot correct) any database corruption.

  • Commits, then, are objects in the main database. They have two parts: a snapshot (of all files, as of the form those files had at the time the snapshot was made) and some metadata. We'll skip all the details here as they're irrelevant, but the effect of this is that each commit stores every file. That includes files that you deliberately did not check out via sparse checkout.

  • Git makes new commits from what Git calls the index, or the staging area, or the cache. The last term is rare these days and found mostly in the --cached flag arguments to various Git commands. These three names describe an intermediate data structure that Git uses for multiple purposes:

    • to keep tabs on your working tree (the cache aspect), and
    • to store the file names and modes for the proposed next snapshot (the staging area aspect).

    There's a third purpose that comes up when the index gets expanded during a conflicted merge, but we'll skip over it here as this is irrelevant to the issue at hand.

  • Finally, in your working tree, Git extracts files from a commit. Normally Git extracts all the files from the commit. The actual practice here is that Git first copies all the files to Git's index. This creates space for the cache part, and creates the name-and-mode part and stores a blob object hash ID to represent the file's actual content.

Git needs this index to hold all the files from the commit, and that's true even when using sparse checkout. So Git's index always holds every file. This takes relatively little space since the actual contents are stored as blob objects in the big database. However, if you're not using sparse checkout, Git then expands every index-entry file into a working tree copy that's an actual, readable and writable, file, not just some internal blob object in the database.

We need the real files to get any actual work done. If all we need to do is keep the files around for use in git diff and to go into new commits and such, and we don't have to actually read and write them, we can keep them as internal blob objects, so that's what Git does with all the commits that aren't checked out.

So, this is where sparse checkout enters the picture. We just tell Git: Oh, by the way, when you get around to extracting all the files from the index, skip some of them. To tell this to Git, at the low level interface between the index and the working tree, we have Git set one bit in the cache data. This bit is called the skip-worktree bit, and we can explicitly set or clear it with:

git update-index --skip-worktree path/to/file

or:

git update-index --no-skip-worktree path/to/file

Note that this has no effect on any actual stored object in the big database, and has no actual effect on any file in our work-tree (or not in our work-tree). It simply sets or clears the bit on the index entry. For this to work, the index entry has to exist.

We could, then, implement sparse checkout by:

  • picking a commit;
  • reading that commit into the index, without creating a working tree yet;
  • setting all the skip-worktree bits we like; and
  • checkout out the index to our working tree.

There are low level commands in Git that will do exactly this. The reason we have the sparse checkout feature, rather than using those low level commands, is that doing this for every file is a royal pain in the ass. So the sparse checkout feature just makes git checkout do this automatically: we tell Git which files should appear in our working tree, and which ones should go into Git's index but have the skip-worktree bit set.

Now let's go back to git commit and make a note about how it really works. When we run git commit, we're telling Git to make a new commit. Git does not use our working tree at this time. We can run git status first and get a listing, or we can let git commit run git status (it does that by default: we have to explicitly suppress it if we don't want that) and populate our commit message template with the result, but one way or another, the commit doesn't commit from our working tree.1 It comes from the index—which already holds every file, including those not extracted to our working tree.

What this means is that when you work with a sparse checkout, you still work with every file. It's just that all the files are in Git's index, where you (and programs) cannot see or change them. Your working tree omits the expanded, normal-file form of some files, so that you can't see or change them. It holds the expanded, normal-file form of other files, so that you can see and change them—but if you do change them, you still need to run git add to copy them back into the index.2 Git is, after all, going to build the next commit from what's in the index, not what is in your working tree!

A good way to think about this is the index holds your proposed next commit. Since the index has all files (as taken from the current commit), it doesn't matter what's in your working tree. That's why you don't have to do anything. You can leave the working tree file there, even if you plan to do nothing with it. It's going to be in new commits whether or not it's there in your working tree as long as it is in Git's index. So don't bother removing it.


1When using git commit --only or git commit --include with pathspecs, the commit code first makes an extra temporary index, then updates the temporary index, as if via git add, and then makes the new commit from the temporary index. It then adjusts the real index if and only if the commit succeeds. We'll skip all these details, but note that even in these modes, the commit is built from an index. It's just that instead of using "the" index, Git is using a temporary auxiliary index.

2Not that it really matters, but the git add step works by squishing the working tree copy back down into an internal Git object, producing a blob hash ID. This is automatically immediately de-duplicated against any existing matching blob, so that the repository database only grows if the content has never been seen before. Git then stuffs the hash ID into the index, so that the index is now updated.


What if the working tree file is in your way?

Suppose that the working tree file is so big that it's filling up a small (SSD?) drive. You don't need it and it is in the way. How can you remove it now, from your sparse checkout, without removing it from future commits?

If you read through the mechanism description above, the answer is obvious—at least, the high level answer; the set of Git commands might still be a little obscure (though I did mention them). You just need to remove the copy of the file from your working tree. This part is entirely straightforward. You don't need any special commands. The regular everyday computer command to remove a file, whether that's rm or DEL or whatever, works, because your working tree is a regular everyday set of files. So just rm bigfile or whatever.

Once you do, however, git status will start whining about it: it will say that the working tree copy of the file is gone. Worse, a blanket git add operation might remove the index copy,3 so from this point forward you may need to be careful with git add commands. This is where you want to use a Git command:

git update-index --skip-worktree bigfile

This sets that skip-worktree bit that I mentioned earlier, that the sparse checkout code uses. The skip-worktree bit simply tells various Git commands, including git status and blanket en-masse git add commands, that the working tree copy, or lack thereof, should be completely ignored. Just keep whatever is in the index, in the index.

Hence, those two commands—the everyday "remove a file" one, and the git update-index one with the --skip-worktree flag—suffice to get rid of the file from your working tree without affecting the copy in Git's index. The index copy will go into future commits, as it should. Remember that the commits are de-duplicating files, so this is just re-using the copy from earlier commits and takes essentially no space.

The choice is thus yours: do nothing at all (because nothing needs to be done), or remove the file without using a Git command, and if git status gets complain-y, set the skip-worktree bit.


3To make this make sense, think of git add as meaning make the index copy of some file match the working tree copy of that file. If the working tree copy has been removed, this removes the index entry.