I've moved a bunch of files to a separate branch, to avoid mixing separate histories. Then I moved the files on the new branch to /
(so that they can be checked out in a worktree rooted exactly where they were initially located):
git branch -c scripts
git rm -rf scripts/
git commit -m 'scripts -> new branch'
git switch scripts
# git rm -rf files NOT under scripts/
git mv scripts/* .
git commit -m "scripts -> top level of new branch
Now, of course, while new history looks clean on both branches, I'm having trouble viewing past history. git log --follow
expects a single file name.
On the main branch, I can
xargs -0 -a <(git ls-files -z) git log --
(get log for existing files)
On the scripts branch
, it's not so simple; I think I need to either
- find all previous names of existing files, then
git log
current + old names (though that could be a lot of arguments), or - make
git log
ignore commits which refer to deleted (but not renamed) files (this seems unfeasible).
How can I achieve this?
You're looking at this all wrong, from Git's point of view anyway. (You could equally say that Git looks at this all wrong—but that's hardcoded into Git and is not going to change!)
In particular, there is no such thing as file history, in Git. History, in Git, is commits. The commits are the history. That's all there is.
Git does have, as an option, the ability to guess about file renames. This is useful when you use
git log
with flags and/or arguments that tell it to show something that isn't the actual history, such as: show me a reduced history that includes only commits in which, from parent to child, the given path-name(s) changed contents. This is whatgit log -- setup
does: tellgit log
to walk revisions (commits), but only print information about a commit where, between parent and child, somescripts/*
file as been modified.The
-M
flag won't really help because it doesn't change the file name(s) thatgit log
is using for its restrictions. Here is the rest of what you need to know.Commits form a Directed Acyclic Graph or DAG
Each commit is numbered, with a hash ID, and each commit stores two things:
The commits themselves form the nodes (or vertices) of a graph, with the hash IDs stored as parent-numbers in each commit forming the links (one-way edges or arcs) between these nodes.
This DAG is the generalization of a tree. A tree data structure would allow for branching:
Here, each uppercase letter stands in for a commit. Newer commits appear towards the right. The edges between commits are the one-way (backwards-looking) arcs: from commit
J
, we can work backwards toI
, and thence toH
andG
and so on. FromL
, we work backwards toK
, thenH
and so on as before.Merge commits turn this tree structure into the DAG:
Merge commit
M
points backwards to two commits, so that as Git moves backwards through history, it has to branch: to visit bothL
andJ
, in some order.When
git log
is run without options, it starts at the current tip commit as found by the nameHEAD
and works backwards. At merges, it follows both branches. Because it can only actually work on one commit at a time, it handles this using a priority queue.Run
git log
without options or commit-specifying arguments, and this priority queue starts with just one entry for theHEAD
commit. The one entry is popped from the queue and the commit is shown. Then that commit's parent or parents are put into the queue, and we repeat. As long as each commit visited this way has just one parent, the queue never grows any longer, and the fact that there is a queue is invisible. But when we hit merge commitM
, both parents go into the priority queue. Now the priority matters: the next commit shown is whichever one is at the front of the queue.The default is to visit in committer-date order, with higher date values (later commits) being higher priority. So if the committer-date of
L
is higher than that ofJ
, we'll seeL
next, and commitK
will enter the queue. Otherwise we'll seeJ
next, andI
will enter the queue. The queue continues to have two entries, sogit log
moves on to the highest priority entry.Branch names
Branch names like
main
ormaster
simply gain you (and Git) an entry point into the graph. Without the names, you would be forced to resort to raw hash IDs, which nobody wants to use. A branch name is simply a moveable pointer, pointing to some commit node in the graph. The names themselves don't matter: what matters is that each name gives us the hash ID of one commit, which lets us find other, earlier commits.Run with a branch name,
git log
starts from that commit. Run with more than one branch name,git log
puts each commit found via the branch name into the priority queue, and once again, the priority determines which commit is shown next.How
git log
handles the queue and shows commitsNote that in all cases,
git log
is simply walking the queue, inserting parents as it goes, and showing one commit at a time as it goes. The default action, with no options, is to show each commit with the--pretty=medium
format (though this default is adjustable).We can, however, restrict
git log
from showing all commits. We can also alter how it feeds the queue and how it sorts the queue (i.e., what the relative priorities are). Giving a pathname or pathspec argument does both. This is important (hence the italics). The part about which commits get shown is easier to describe, because you see its effects immediately. Before we get into this, though, it's worth a quick side trip throughgit diff
.A two-commit diff
Except when we use
git checkout
orgit switch
to extract an entire commit, we're often not really interested in the fact that a commit is a full archival snapshot. We're often more interested in what the difference is between two commits.The
git diff
command can show us that. For instance, given two commitsE
andH
, we can rungit diff E H
to see what's different. Because of Git's internal storage format—files are stored de-duplicated as well as compressed and such—Git can very quickly tell that some file inE
is exactly the same as the file of the same name inH
, and not bother showing us anything about that file. For files that are different, Git can play a game of Spot the Difference and tell us what changed.1If we choose commits that are adjacent—that are parent-and-child, like
G
andH
—we can see what changed in that one specific commitH
. This is particularly useful to humans. The commit has all the changed files (because it has all the files), and the log message tells us why the humans who made the changes, made them: we can look and see whether the change they made achieve their goal. That's just one example; there are plenty of examples where this is useful. The point is thatgit diff
can do this pretty easily.1More precisely, the output from
git diff
is a recipe for changing the left-side file into the right-side file. It doesn't necessarily reflect how we made that happen, just a way to make it happen. This sometimes matters when the change involves bits of syntax, like close-bracket lines, that Git doesn't understand properly. Git will suggest deleting the wrong bracket, because some other bracket seems the same—but sometimes that's not quite right.Log, again
When
git log
is working on a normal non-merge commit, it has both the original commit and the hash ID of its (single) parent. This means it can easily rungit diff
. In fact, finding out if some file(s) have changed is even easier, as we don't need to have Git play spot-the-difference, which is the slow part: we just have Git find out whether there is a difference. Sogit log
has this built in. (It has the full diff built in too, if we want it, but it can do this part fast.)For normal
git log
, this doesn't really matter: it is going to show the commit's metadata, with the appropriate--pretty
format, no matter what. But when we rungit log
with path names, whatgit log
does is to first filter out files that aren't listed—it does this with both this commit and its parent—and then compare the resulting files. If they are all the same,git log
simply doesn't print the commit at all.What this means is that:
only prints (with the
--pretty
format) those commits in which the file in question differs from the copy in its parent. People like to call this "file history", but it's not: it's simply filtered commit history. Where it falls down—or apart—is precisely where you're into a problem now: what, exactly, does it mean for something to be the same file in commitsG
andH
?Git does not store renames, but it has the ability to reconstruct renames on the fly, through its diff engine. When a left-side commit (
G
) has some file X that does not exist in the right-side (H
) commit, and the right-side commit has some file Y that does not exist in the left-side commit, Git will, optionally, examine that file-pair to see if the content matches or is similar.Exact matches are very fast to find (due to the de-duplication trick). "Similar" files are harder and slower. Using
-M
enables both kinds of rename detection, both ingit diff
and ingit log
. The--follow
option also enables this rename detection—but it's a horrible hack: when using--follow
,git log
has just one pathname, and what--follow
does is to enable-M
and catch renames and change the one file name that it is looking for. So ifG
-vs-H
renames the one file, then by the timegit log
is looking at commitsG
and earlier, the one name it is looking for is the old name fromG
, rather than the new name fromH
.Merges: feeding the queue, handling names
A merge commit differs from a regular commit in that a merge commit has two (or more, but usually two) parents. This means we need two diffs. Git can do this, but there are multiple catches.
First,
git log
normally doesn't bother. If you're doing agit log
with no arguments, or even with-p
turned on to show a diff from parent to child at each commit, whengit log
hits a merge commit, it just throws up its virtual hands and declares this to be too hard. It prints the log message but doesn't run any diffs at all, and then it adds both (or all) parents to the priority queue as usual.Second, if you add a pathname or pathspecs,
git log
will do the filtering as usual. To do that, it strips down both this commit and all of it parents to the set of files in your pathspec(s). It then checks to see whether everything matches in this commit and any one of the parents. If so, it does two things:Since there's a match, it follows just that one parent. The theory here goes that you're trying to see why these files look the way they do in the commit you started from—e.g., the
HEAD
commit—and those other parents didn't contribute anything, so why bother looking?Since that is a match, it does not print this commit.
So, at merges, we prune off all the branches—or rather, the merge parents, which are branches in the backwards direction that Git uses—that didn't contribute, and then we don't print the merge commit either. This particular bit of simplification is never directly visible!
If the merge commit's filtered snapshot differs from every parent, though,
git log
will (a) print the commit and (b) follow all parents. We can also forcegit log
to follow all parents with--full-history
, regardless of whether the filtering shows differences.Note that if we have
--follow
turned on, rename detection can make a mess of this. Suppose we are at mergeM
, with parentsJ
andL
, and we find that inM
, filexyz.ext
is renamed fromabc.ext
. But it has names in bothJ
andL
, and one of them is probably stillabc.ext
, while the other isxyz.ext
. If we visit both branches, and the rename happens betweenI
andJ
before we go on to look atL
andK
, we'll be looking for the wrong name when we get to the other commits. (How well or poorly this works, in which cases, depends on a lot of factors.)We have several extra options here as well:
-m
causesgit log
(andgit diff
) to split a merge into multiple virtual non-merge commits. That is, instead of considering commitM
, our merge, as a single commit with two parents, we havegit log
orgit merge
pretend that there are two commits:M'
has parentJ
, andM''
has parentL
. These are now normal single-parent commits, which can be displayed using the single-parent displaying methods and diffed using the single-parent diff methods.--first-parent
causesgit log
to ignore the extra parents of a merge. This affects both its diff—M
will be compared only againstJ
, assumingJ
is the first parent—and the revision walk:M
will be treated as a single-parent commit and its (lone) parent will be put on the queue.Note that the first parent of any merge is the branch you (or whoever) were on when you (or they) made the merge.
Conclusion
History, in Git, is nothing but the commits, as found by starting from branch names (or other starting points) and working backwards. Moving files won't avoid "mixing history": it just means that the snapshots in the commits will be different. The branch names are not the history at all: they're just entry points into the graph.
Git does not have true file identity. It cannot be sure that a file named X in commit
A
is the same as, or different from, a file named X—or one named Y—in commitH
. Its default assumption is same-name = same-file. This can be adjusted on a diff-by-diff basis. The tools for this are a bit crude, though.