git log after moving file to separate branch

147 views Asked by At

I've moved a bunch of files to a separate branch, to avoid mixing separate histories. Then I moved the files on the new branch to / (so that they can be checked out in a worktree rooted exactly where they were initially located):

git branch -c scripts
git rm -rf scripts/
git commit -m 'scripts -> new branch'
git switch scripts
# git rm -rf files NOT under scripts/
git mv scripts/* .
git commit -m "scripts -> top level of new branch

Now, of course, while new history looks clean on both branches, I'm having trouble viewing past history. git log --follow expects a single file name.

On the main branch, I can

  • xargs -0 -a <(git ls-files -z) git log -- (get log for existing files)

On the scripts branch, it's not so simple; I think I need to either

  • find all previous names of existing files, then git log current + old names (though that could be a lot of arguments), or
  • make git log ignore commits which refer to deleted (but not renamed) files (this seems unfeasible).

How can I achieve this?

1

There are 1 answers

3
torek On

You're looking at this all wrong, from Git's point of view anyway. (You could equally say that Git looks at this all wrong—but that's hardcoded into Git and is not going to change!)

In particular, there is no such thing as file history, in Git. History, in Git, is commits. The commits are the history. That's all there is.

Git does have, as an option, the ability to guess about file renames. This is useful when you use git log with flags and/or arguments that tell it to show something that isn't the actual history, such as: show me a reduced history that includes only commits in which, from parent to child, the given path-name(s) changed contents. This is what git log -- setup does: tell git log to walk revisions (commits), but only print information about a commit where, between parent and child, some scripts/* file as been modified.

Side note: As you mentioned, the --follow flag to git log only works with a single file name (because the implementation of --follow is a horrible hack) and has a bunch of defects. I believe the --follow handling should be completely thrown out and rewritten, but this is a nontrivial undertaking (to put it lightly). At some point, even if --follow isn't completely rewritten, it might start to work for one directory at a time; that point might even have been Git 2.17 or so; I have not tested this, and for this particular case, it wouldn't be the right answer anyway.

The -M flag won't really help because it doesn't change the file name(s) that git log is using for its restrictions. Here is the rest of what you need to know.

Commits form a Directed Acyclic Graph or DAG

Each commit is numbered, with a hash ID, and each commit stores two things:

  • a full snapshot of every file (as of the form it had when the commit was made); and
  • metadata: information such as the author and committer, a commit log message, and—crucially for Git itself—the hash IDs of the parents of this commit.

The commits themselves form the nodes (or vertices) of a graph, with the hash IDs stored as parent-numbers in each commit forming the links (one-way edges or arcs) between these nodes.

This DAG is the generalization of a tree. A tree data structure would allow for branching:

          I--J
         /
...--G--H
         \
          K--L

Here, each uppercase letter stands in for a commit. Newer commits appear towards the right. The edges between commits are the one-way (backwards-looking) arcs: from commit J, we can work backwards to I, and thence to H and G and so on. From L, we work backwards to K, then H and so on as before.

Merge commits turn this tree structure into the DAG:

          I--J
         /    \
...--G--H      M--N--...
         \    /
          K--L

Merge commit M points backwards to two commits, so that as Git moves backwards through history, it has to branch: to visit both L and J, in some order.

When git log is run without options, it starts at the current tip commit as found by the name HEAD and works backwards. At merges, it follows both branches. Because it can only actually work on one commit at a time, it handles this using a priority queue.

Run git log without options or commit-specifying arguments, and this priority queue starts with just one entry for the HEAD commit. The one entry is popped from the queue and the commit is shown. Then that commit's parent or parents are put into the queue, and we repeat. As long as each commit visited this way has just one parent, the queue never grows any longer, and the fact that there is a queue is invisible. But when we hit merge commit M, both parents go into the priority queue. Now the priority matters: the next commit shown is whichever one is at the front of the queue.

The default is to visit in committer-date order, with higher date values (later commits) being higher priority. So if the committer-date of L is higher than that of J, we'll see L next, and commit K will enter the queue. Otherwise we'll see J next, and I will enter the queue. The queue continues to have two entries, so git log moves on to the highest priority entry.

Branch names

Branch names like main or master simply gain you (and Git) an entry point into the graph. Without the names, you would be forced to resort to raw hash IDs, which nobody wants to use. A branch name is simply a moveable pointer, pointing to some commit node in the graph. The names themselves don't matter: what matters is that each name gives us the hash ID of one commit, which lets us find other, earlier commits.

Run with a branch name, git log starts from that commit. Run with more than one branch name, git log puts each commit found via the branch name into the priority queue, and once again, the priority determines which commit is shown next.

How git log handles the queue and shows commits

Note that in all cases, git log is simply walking the queue, inserting parents as it goes, and showing one commit at a time as it goes. The default action, with no options, is to show each commit with the --pretty=medium format (though this default is adjustable).

We can, however, restrict git log from showing all commits. We can also alter how it feeds the queue and how it sorts the queue (i.e., what the relative priorities are). Giving a pathname or pathspec argument does both. This is important (hence the italics). The part about which commits get shown is easier to describe, because you see its effects immediately. Before we get into this, though, it's worth a quick side trip through git diff.

A two-commit diff

Except when we use git checkout or git switch to extract an entire commit, we're often not really interested in the fact that a commit is a full archival snapshot. We're often more interested in what the difference is between two commits.

The git diff command can show us that. For instance, given two commits E and H, we can run git diff E H to see what's different. Because of Git's internal storage format—files are stored de-duplicated as well as compressed and such—Git can very quickly tell that some file in E is exactly the same as the file of the same name in H, and not bother showing us anything about that file. For files that are different, Git can play a game of Spot the Difference and tell us what changed.1

If we choose commits that are adjacent—that are parent-and-child, like G and H—we can see what changed in that one specific commit H. This is particularly useful to humans. The commit has all the changed files (because it has all the files), and the log message tells us why the humans who made the changes, made them: we can look and see whether the change they made achieve their goal. That's just one example; there are plenty of examples where this is useful. The point is that git diff can do this pretty easily.


1More precisely, the output from git diff is a recipe for changing the left-side file into the right-side file. It doesn't necessarily reflect how we made that happen, just a way to make it happen. This sometimes matters when the change involves bits of syntax, like close-bracket lines, that Git doesn't understand properly. Git will suggest deleting the wrong bracket, because some other bracket seems the same—but sometimes that's not quite right.


Log, again

When git log is working on a normal non-merge commit, it has both the original commit and the hash ID of its (single) parent. This means it can easily run git diff. In fact, finding out if some file(s) have changed is even easier, as we don't need to have Git play spot-the-difference, which is the slow part: we just have Git find out whether there is a difference. So git log has this built in. (It has the full diff built in too, if we want it, but it can do this part fast.)

For normal git log, this doesn't really matter: it is going to show the commit's metadata, with the appropriate --pretty format, no matter what. But when we run git log with path names, what git log does is to first filter out files that aren't listed—it does this with both this commit and its parent—and then compare the resulting files. If they are all the same, git log simply doesn't print the commit at all.

What this means is that:

git log -- file/in/question.ext

only prints (with the --pretty format) those commits in which the file in question differs from the copy in its parent. People like to call this "file history", but it's not: it's simply filtered commit history. Where it falls down—or apart—is precisely where you're into a problem now: what, exactly, does it mean for something to be the same file in commits G and H?

Git does not store renames, but it has the ability to reconstruct renames on the fly, through its diff engine. When a left-side commit (G) has some file X that does not exist in the right-side (H) commit, and the right-side commit has some file Y that does not exist in the left-side commit, Git will, optionally, examine that file-pair to see if the content matches or is similar.

Exact matches are very fast to find (due to the de-duplication trick). "Similar" files are harder and slower. Using -M enables both kinds of rename detection, both in git diff and in git log. The --follow option also enables this rename detection—but it's a horrible hack: when using --follow, git log has just one pathname, and what --follow does is to enable -M and catch renames and change the one file name that it is looking for. So if G-vs-H renames the one file, then by the time git log is looking at commits G and earlier, the one name it is looking for is the old name from G, rather than the new name from H.

Merges: feeding the queue, handling names

A merge commit differs from a regular commit in that a merge commit has two (or more, but usually two) parents. This means we need two diffs. Git can do this, but there are multiple catches.

First, git log normally doesn't bother. If you're doing a git log with no arguments, or even with -p turned on to show a diff from parent to child at each commit, when git log hits a merge commit, it just throws up its virtual hands and declares this to be too hard. It prints the log message but doesn't run any diffs at all, and then it adds both (or all) parents to the priority queue as usual.

Second, if you add a pathname or pathspecs, git log will do the filtering as usual. To do that, it strips down both this commit and all of it parents to the set of files in your pathspec(s). It then checks to see whether everything matches in this commit and any one of the parents. If so, it does two things:

  • Since there's a match, it follows just that one parent. The theory here goes that you're trying to see why these files look the way they do in the commit you started from—e.g., the HEAD commit—and those other parents didn't contribute anything, so why bother looking?

  • Since that is a match, it does not print this commit.

So, at merges, we prune off all the branches—or rather, the merge parents, which are branches in the backwards direction that Git uses—that didn't contribute, and then we don't print the merge commit either. This particular bit of simplification is never directly visible!

If the merge commit's filtered snapshot differs from every parent, though, git log will (a) print the commit and (b) follow all parents. We can also force git log to follow all parents with --full-history, regardless of whether the filtering shows differences.

Note that if we have --follow turned on, rename detection can make a mess of this. Suppose we are at merge M, with parents J and L, and we find that in M, file xyz.ext is renamed from abc.ext. But it has names in both J and L, and one of them is probably still abc.ext, while the other is xyz.ext. If we visit both branches, and the rename happens between I and J before we go on to look at L and K, we'll be looking for the wrong name when we get to the other commits. (How well or poorly this works, in which cases, depends on a lot of factors.)

We have several extra options here as well:

  • -m causes git log (and git diff) to split a merge into multiple virtual non-merge commits. That is, instead of considering commit M, our merge, as a single commit with two parents, we have git log or git merge pretend that there are two commits: M' has parent J, and M'' has parent L. These are now normal single-parent commits, which can be displayed using the single-parent displaying methods and diffed using the single-parent diff methods.

  • --first-parent causes git log to ignore the extra parents of a merge. This affects both its diff—M will be compared only against J, assuming J is the first parent—and the revision walk: M will be treated as a single-parent commit and its (lone) parent will be put on the queue.

Note that the first parent of any merge is the branch you (or whoever) were on when you (or they) made the merge.

Conclusion

History, in Git, is nothing but the commits, as found by starting from branch names (or other starting points) and working backwards. Moving files won't avoid "mixing history": it just means that the snapshots in the commits will be different. The branch names are not the history at all: they're just entry points into the graph.

Git does not have true file identity. It cannot be sure that a file named X in commit A is the same as, or different from, a file named X—or one named Y—in commit H. Its default assumption is same-name = same-file. This can be adjusted on a diff-by-diff basis. The tools for this are a bit crude, though.