Labeling / Marking a group of files in Git

3.2k views Asked by At

Is there a way to logically mark a group of files with a label in Git? I understand how tags and Gitlab labels work but... in both cases those markers are applied to commits.

The applications we use, ETL, use a specific directory tree that is not quite flexible when it comes to identifying that 3 files belong to Solution X and 2 other files (in the same directory) belong to Solution Y.

Marking / labeling a subset of files instead of placing them in feature-specific folders or encoding in the naming convention... would make them easier to identify.
The files, BTW, are either XML (ETL jobs) or flat files (SQL/DDL).

How would you do this?

4

There are 4 answers

1
torek On BEST ANSWER

As others have said, there's nothing built in to do this. We might, however, note that Git stores four kinds of objects in the repository itself: blobs (files), trees (representing directories-full-of-files), commits (which form a directed acyclic graph through parent identifiers, with each commit carrying one tree object, one author, one committer, its set-of-parents, and an arbitrary log message), and annotated-tag objects (which have no strongly defined relationship, but every tag has exactly one target object, normally a commit).

Aside: how git notes work

Git's notes are represented internally as commits. Each commit has, as its tree, a set of "files" whose names merely happen to be—entirely by accident :-) ... cough, ahem—exactly the same1 as some set of commits in the repository. When git log goes to display a commit whose ID is C, it "accidentally" checks to see if refs/notes/commits exists, and if so, whether a file named C exists in the commit to which refs/notes/commits points, and if so, it appends the contents of that file to the log message. So this is how notes attach to commits: one built-in part of Git checks to see if a special reference (refs/notes/commits) points to a commit containing a tree containing a "file" (and it really is a file in the end, as it's an ordinary Git blob object) that should be tacked on to the commit log message.

When you revise the set of notes, Git simply makes a new commit with a new tree. The new commit points back to the previous refs/notes/commits commit as its parent, so that the older notes remain in existence and can (with some difficulty) be viewed as they were in the past (this used to be very hard; I believe it has become easier). Git's natural pack-file compression handles these quite well, so that the space occupied by notes grows only linearly with new note additions.


1For efficiency, the "name" gets modified somewhat, so that instead of the note file being named deadbeefcafebabec0ffeedecadefadedbedcede, for instance, it might be named de/ad/beefcafe....


Thus, the solution is obvious (ahem)

You want to represent a set of files arranged in a directory or series of directories. That is, of course, a tree, and Git has tree objects. Therefore you should create a tree object to hold one of these states.

You did not say whether you wish to keep multiple historical versions of this tree. If you do, the solution is obvious: handle them just as git notes does, using a new commit object to store each new tree, chaining the commits to make past versions retrievable. If not, it's up to you whether to create a commit object at all, as a single tag object could point directly to the tree object. (Some non-Git tools may have issues with tags to anything other than a commit or another tag. There are a few of these tags in the Git repository for Git itself, though.)

In any case, you will also need one top-level reference to point to your commit-or-annotated-tag-object that points to the most current tree.

Creating the tree is easy: simply populate an index file—not the regular one, but an alternate, whose name you write into the environment variable GIT_INDEX_FILE—with the file names as you wish them to appear in your tree, by git add-ing the files (this will also put the necessary blobs into the repository if they are not there already), then invoke git write-tree. This will turn the index into the desired tree object (after which you may, and probably should, discard the exported GIT_INDEX_FILE setting), printing the new object's ID to standard output as usual. Writing a commit and/or tag object to point to this tree is then merely a matter of invoking git commit-tree and/or git mktag (which will write their own new objects to the repository, printing the IDs to standard output as before). Last, use git update-ref to create or update a reference to point to the tag or commit object—and you have now re-implemented git notes, but in a form more suitable for your own desires.

You can extract any saved tree any time, to any work-tree of your choice, with a simple git checkout using another temporary index.

0
Makoto On

Git has no natural way to accomplish this, since it really only cares about the changes in files it's instructed to monitor.

A solution to this, albeit hodgepodge, would be to use commit tags to be able to refer to the project at a particular state. Again, Git cares not for the individual files or even for they're related; it's your responsibility to create that relationship.

1
VonC On

Since it is not practical to group those files in a commit each time you want to apply a label, you can at least consider leaving you a note reminding you what are the files belonging to which Solution.

See git notes: that would be a text note, managed by you, listing the files to be considered for a given solution.
For a given commit, you can attach multiple notes.

This is a workaround, in order to take into account the fact that Git itself, based on commits, would not be able to label files.

2
yelsayed On

Are you trying to track them separately? If so, I'm wondering if you can use two different local git repo's to achieve this. Take a look at this answer to see how to manage two different git repo's in the same directory.

Otherwise, if you just need a way of tagging files on your filesystem, git won't help much. How about including two files .sol1 and .sol2 that list files in the respective solution, then write a small script for your git operations. For example when you do git status -sol1, it'd first do git status on the entire directory, then filter it out by files in .sol and so on. Writing such a script shouldn't be difficult, and I think it might prove useful in other scenarios as well. If you need help let me know.