I have a public repository for my students in which I pushed a large dataset and some corrections. The thing is we have a storm here at the moment and connection is really poor. Moreover, students already downloaded the datasets on their own for the purpose of the exercises.
My question, for which I can't figure out an easy solution : is there a way for them, to pull the repo without the dataset (just for that time) and whenever connection will be back, on their next pull, they will be able to pull everything.
I was thinking git fetch
+ git merge
only wanted files, but that's not exactly what I wanted since ideally it would be a git fetch
+ git merge
excluding one folder (the data folder).
Hope my issue is clear enough and that it has an easy solution ! Thanks for your help
Pull is just fetch + merge (or fetch + rebase); it's the fetch step that brings in new Git objects.
Fetch works on a commit by commit basis. If the objects are part of a commit that is wanted, your students will get the whole commit, or not get the commit at all.1 The trick, then, is to create some new and different commits that are easier to get and provide just what your students need.
To do that:
Find the commit before the one that has the large dataset added. This commit has some hash ID,
$hash
.Create a new branch name pointing to this commit:
Make new commits from here as needed, e.g., to add corrections to files but without adding the large dataset.
Have your students fetch just this branch:
and then check out this new branch and work there.
Later, when things are good, merge the main branch with the large data into the new branch:
Continue working with the new branch from here on: you can even just delete the old main branch entirely now. The new branch is now the main branch.
Pictorially, what we're doing is this. We start with:
where commit
H
is the one with the hash we care about: before the big-data are added in commitI
. CommitJ
may or may not exist: if it does not, commitI
is the last commit on the main branch. CommitsI
andJ
all have the large files in them so if anyone goes to fetch commitJ
, or commitI
, they will get all the data. So we add a new name that points to commitH
:Now we grab any updates from
I
and/orJ
that we'd like and use those to update some files and make a new commit-snapshotK
:Commit
K
does not have the big files. By avoiding fetchingI
andJ
, nobody has to wait for the big files.Any additional work adds more commits:
which is fine, and eventually we bring the big files in with
git merge
:so that commit
M
has the big files. We now delete the namemain
as it's no longer useful to anyone:New commits get added to
newbranch
as usual; the big data files arrived through commitI
; and there either were no merge conflicts atM
, or if there were, you solved them by taking the appropriate files from commitL
; nobody else had to solve anything.1There is a new feature in very modern versions of Git that would allow partially fetching a single commit. But using this is tricky, and not the right way to do what you want.