I recently started working with git trees and temporary index files to construct commits without modifying my working directory, for the purpose of automating some tasks. The ultimate goal is to effectively rebase some branch (e.g. feature/x_y_z
on top of main
), but without modifying the working directory to do it. Obviously I still want to detect conflicts, and definitely not clobber changes on main
(as one could using git commit-tree
). I read through the "Git Internals" chapter of the book, which was very educational regarding trees, blobs, indexes, etc. — but didn't explicitly explain how a rebase works.
(Aside: The motivation for this is 1) it's way faster, and 2) I want to enable developers to kick off tests for some commit/branch, quickly, consuming latest canonical changes, without clobbering their working directory.)
To that end, how does git rebase
work under the hood? What plumbing commands does it use? How does it analyze trees to detect conflicts? Links to useful sources, and/or a direct explanation of these things, would be very helpful.
It's complicated. It's particularly complicated because of history: it was originally a small shell script using
git format-patch
andgit am
, but that has some flaws, so it was rewritten as a much fancier set of shell scripts. These included a merge based back-end and an interactive back-end, splitting off the oldam
-based back-end into a third variant. Since then, it's been rewritten yet again, in C, to use what Git calls the sequencer. The interactive code has been fancied up to allow re-performing merges as well. I'll ignore these cases as they're harder to draw and explain.Now that it has been rewritten in C, it does not use any of them.
In the old days, the interactive back end mostly used
git cherry-pick
(which technically isn't quite a plumbing command), plusgit commit --amend
for squash operations, after usinggit rev-list
to collect the hash IDs of the commits to copy with cherry-pick.The C variant is being modified now to build in more and more parts (to make things go faster on Windows, mainly) but does still invoke the merge separately at the moment.
This is the fundamental work of
git cherry-pick
: it invokesgit merge
but sets the merge base to be the parent of the commit being copied. The current commit at this time is the commit at the tip of the branch being extended to achieve the rebase.That is, we have something along these lines:
The branch we want to "rebase" is the one identified by the name
to-copy
; the commit we want the copies to appear after is commitC
. So we run:to make sure we're starting from the right place, then run:
Or, if what we have looks like this:
and we want to copy just
H-I-J
to land afterC
, we run:so as to exclude commits
D-E
from the copy list.The rebase operation first generates a list of commit hash IDs to copy, in this case, the actual raw hash IDs for commits
H
throughJ
inclusive.Rebase will normally omit, from this list, certain commits:
-r
or-p
options I'm deliberately ignoring); andgit patch-id
matches that of a commit in the other half of a symmetric difference is also omitted.1For most simple linear chains of commits, this omission step does nothing at all; that's the case with the commits I'm illustrating here.
Having built the list of commit hash IDs to copy, rebase now checks out the
--onto
target commitC
as a detached HEAD. If there is no--onto
argument, the target commit is that specified by theupstream
argument after thegit rebase
command, or is that specified by the upstream of the branch before the HEAD-detaching step. So, for the more-complicated--onto
variant, we now have this:Rebase now cherry-picks each to-be-copied commit, one at a time, in the appropriate and necessary order (
H
first, thenI
, thenJ
). Each of these cherry-pick operations is handled as if it were agit merge
, but with a peculiarly forced merge base commit. I'll go into detail in a moment, but let's just assume that the cherry-pick ofH
works and makes a new commit; let's call the new commitH'
, to indicate that it's a "copy" ofH
, and draw it in:We now repeat this with
I
andJ
to get:Once the last of the to-be-copied commits has been copied,
git rebase
yanks the name of the original branch off the original commitJ
and pastes it on to point to the final copied commit, in this caseJ'
, and then re-attaches HEAD:With no name to find commit
J
, it disappears from our view, and it now seems as though Git has somehow changed three commits. (It hasn't—the originals are still in the repository. You can find them through reflogs, or throughORIG_HEAD
, though the C rewrite of rebase introduced a bug whereORIG_HEAD
is sometimes wrong.)1The actual symmetric difference used is
HEAD...target
, more or less. (Since it's symmetric, you can swap the left and right sides, as long as you remember which side is which.) So these are the commits that have their patch-IDs computed. Git can compute patch IDs even for merge commits, though rebase normally omits merges. I've never dived in far enough to see if it does compute them when you tell it to copy merges, and what happens in this case if a merge commit does have a duplicate, but that's an interesting question.Git's merge engine
To understand cherry-picking, let's start with a much more normal, everyday operation: a true merge. When we do a true merge, we're combining work. Let's suppose we have the following commit graph:
That is, we have two branches,
br1
andbr2
, each of which has two commits of its own that follow some shared sequence of commits ending at commitH
.As you know by now from reading up on Git internals, each commit has a full snapshot of every file. Being a snapshot, not a set of changes, there's no obvious way to view the snapshot as changes, until you realize that what Git does is play, over and over again, a game of Spot the Difference. We put two commits up somewhere, as two snapshots, then programmatically eyeball each one and figure out what's changed. That's what
git diff
does.Now, if we run a diff from commit
H
to commitJ
, that will tell us what changed between those two snapshots. By itself, that's not particularly useful, but suppose we save this information somewhere. Let's run:to find out what we changed on
br1
since commitH
. We'll save all this somewhere, perhaps in a temporary file or (if we have enough memory) memory.Now let's repeat this operation but with the hash of commit
L
instead:This tells us what happened on
br2
.If we add the changes together, being careful to take only one copy of any given change that's on both branches, and apply the sum of the two sets of changes to snapshot
H
, we'll get the correct merge result.2This, then, is exactly what merge does. It just runs two diffs—using
--find-renames
to find any tree-wide file rename operations, so that it knows that fileold/path/to/file
in the merge base is "the same file" asnew/name/of/it
in the left and/or right side tip commit—and then combines the change-sets from the two diffs, applying them to each file.3If the combining goes well, and the merge isn't inhibited with
--no-commit
,4 Git will go on to make merge commitM
on its own. Instead of the normal single parent, which in this case would be commitJ
, the merge commit has two parents. The first one is the normal one, and the second one is the other branch-tip commit, commitL
:and the merge is complete.
If there are conflicts, Git leaves a mess in both its index (the expanded slots remain expanded) and your work-tree: the built in equivalent of
git merge-file
has scribbled over your work-tree files, to put in its best guess at the correct merge, plus merge conflict markers and sections of two—or withmerge.conflictStyle
set todiff3
, all three—of the input files where there were merge conflicts.Note that using
-X ours
or-X theirs
tells Git to resolve conflicted sections by blindly picking our or their side. This only affects these low-level conflicts: add/add, modify/delete, and other high-level or tree-level conflicts still cause the merge to stop and get help.(For cherry-pick, these options are all currently handled through the
git-merge-recursive
back-end, with no way to choose any other merge back-end. Forgit merge
, the-s
argument, e.g.,git merge -s abcd
, makes Git try to rungit-merge-abcd
. Of course there is nogit-merge-abcd
back end in thegit --exec-path
directory, so this would just fail, but the point here is that regular merge lets you select a strategy. Recursive merge is just the default. Cherry-picking does not allow strategy selection.)2For some definition of "correct", of course. Git's is entirely line-based: the diffs are line-by-line and the merge works line-by-line.
3Well, that's the high level overview anyway. Internally, it does the rename-finding as one pass, then the high-level or tree-level file name things as needed—this also handles file creations and deletions, and detects add/add, modify/delete, rename/rename, and other such conflicts—and then goes on to use a separate second pass that has
git merge-file
built into it to merge each individual group of three files: merge-base, ours, and theirs. The merge process takes place in Git's index, which is temporarily expanded to hold up to three copies of each file, with slot numbers to tell which one is the merge base version (slot 1), which one is the--ours
version (slot 2), and which one is the--theirs
version (slot 3).4Note that
--squash
turns on--no-commit
and there's currently no way to turn it off again, so that--squash
always requires a manualgit commit
at the end.How cherry-pick uses Git's merge engine
To achieve a cherry-pick, Git simply runs its merge engine with a forced parent.
Suppose we have this commit graph:
We are on the current branch
cur-branch
, with commitH
as its tip commit, so that commitH
is the current commit. We now run:What Git does is find
C
's parentP
and use that as a fake merge base for a standard merge operation, but make sure that at the end of the merging process, we make a normal, non-merge commit:Commit
C'
ends up being a "copy" of commitC
. To see why, let's look at what diffs turn up.Now, finding "what they changed" seems pretty natural. If they added a line after line 42 of some file, that's what we want to do here. So this diff makes perfect sense. But finding "what we changed" seems a bit odd at first. It turns out, though, that it's exactly what we need.
If they only changed one line of one file, we want to know: did we touch the same line of that file? If not, the changes combine nicely: we take all our changes, to convert all the files in
P
to match all the files inH
, and that gives us back commitH
; then we add the one change they made, to the one file, with any line number adjustments required as well, and that adds what they changed in the one file. So that's exactly right.If we did both touch the same line of that file, we get a merge conflict, right on that one line. That's exactly right too.
As we ponder all possible changes—including things like file renamings—we will see that, indeed, making a diff from
P
toH
is just the right thing to do. So that's why we do it. It also means we get to use the existing merge code.Of course, the existing merge code operates on/in the index, and uses our working tree as scratch storage. That's why rebasing is relatively slow and painful. The only real way to improve on this is to do more of the merge work directly in memory. Someone is doing this now: search for "merge-ort" from Elijah Newren in the Git mailing list.