git pull excluding some file?

314 views Asked by At

I have a public repository for my students in which I pushed a large dataset and some corrections. The thing is we have a storm here at the moment and connection is really poor. Moreover, students already downloaded the datasets on their own for the purpose of the exercises.

My question, for which I can't figure out an easy solution : is there a way for them, to pull the repo without the dataset (just for that time) and whenever connection will be back, on their next pull, they will be able to pull everything.

I was thinking git fetch + git merge only wanted files, but that's not exactly what I wanted since ideally it would be a git fetch + git merge excluding one folder (the data folder).

Hope my issue is clear enough and that it has an easy solution ! Thanks for your help

1

There are 1 answers

1
torek On BEST ANSWER

Pull is just fetch + merge (or fetch + rebase); it's the fetch step that brings in new Git objects.

Fetch works on a commit by commit basis. If the objects are part of a commit that is wanted, your students will get the whole commit, or not get the commit at all.1 The trick, then, is to create some new and different commits that are easier to get and provide just what your students need.

To do that:

  • Find the commit before the one that has the large dataset added. This commit has some hash ID, $hash.

  • Create a new branch name pointing to this commit:

     git branch newbranch $hash
    

    Make new commits from here as needed, e.g., to add corrections to files but without adding the large dataset.

  • Have your students fetch just this branch:

     git fetch origin newbranch
    

    and then check out this new branch and work there.

  • Later, when things are good, merge the main branch with the large data into the new branch:

     git checkout newbranch; git merge mainbranch
    

    Continue working with the new branch from here on: you can even just delete the old main branch entirely now. The new branch is now the main branch.

Pictorially, what we're doing is this. We start with:

...--F--G--H--I--J   <-- main

where commit H is the one with the hash we care about: before the big-data are added in commit I. Commit J may or may not exist: if it does not, commit I is the last commit on the main branch. Commits I and J all have the large files in them so if anyone goes to fetch commit J, or commit I, they will get all the data. So we add a new name that points to commit H:

             I--J   <-- main
            /
...--F--G--H   <-- newbranch

Now we grab any updates from I and/or J that we'd like and use those to update some files and make a new commit-snapshot K:

             I--J   <-- main
            /
...--F--G--H------K   <-- newbranch

Commit K does not have the big files. By avoiding fetching I and J, nobody has to wait for the big files.

Any additional work adds more commits:

             I--J   <-- main
            /
...--F--G--H--K--L   <-- newbranch

which is fine, and eventually we bring the big files in with git merge:

             I----J   <-- main
            /      \
...--F--G--H--K--L--M   <-- newbranch

so that commit M has the big files. We now delete the name main as it's no longer useful to anyone:

             I----J
            /      \
...--F--G--H--K--L--M   <-- newbranch

New commits get added to newbranch as usual; the big data files arrived through commit I; and there either were no merge conflicts at M, or if there were, you solved them by taking the appropriate files from commit L; nobody else had to solve anything.


1There is a new feature in very modern versions of Git that would allow partially fetching a single commit. But using this is tricky, and not the right way to do what you want.