Troubleshooting getting rid of large files in a git repository on GitHub

138 views Asked by At

I have a project called geoplot that does geospatial plotting in Python. The code for it is distributed via git on GitHub. You can check it out here.

As a part of the development process for this package, I uploaded and stored in the geoplot repo a folder called data/ which contained a large number of data files in various formats. These data files were used to populate the examples in the complimentary example gallery.

However, these files bloat the overall repository size way up to ~150 MiB (issue). This is clearly way too much, and it's time for me to get rid of them.

The problem is that I need to not just remove these files from the current HEAD, I also scrub these files out of the entire git history. I tried a manual approach using git rebase that didn't work. Then I tried the BFG Repo-Cleaner tool, as recommended in the canonical SO question on the matter.

BFG rid me of the files alright—they no longer exist anywhere in the history. However, the size of the repo (as seen when running https://github.com/ResidentMario/geoplot.git) did not go down at all!

Here is what I tried (minus printouts):

java -jar ../bfg-1.12.15.jar --delete-folders "data" .
git reflog expire --expire=now --all && git gc --prune=now --aggressive
git push --set-upstream https://github.com/ResidentMario/geoplot.git master --force

The full printout is in an issue on GitHub.

What, if anything, did I do wrong? How do I diagnose the source of and expunge this wasted space?

2

There are 2 answers

5
VonC On BEST ANSWER

I did mention reflog and gc back in 2010, but also removing old objects.
(Note: gc should be followed by a repack)

First, check if by cloning your repo again, you still have the same size.

As the OP Aleksey Bilogur mentions in the comments:

  • you need make sure your tag are not referencing the old data, and then you need to force-push all the tags and branches as well (not just master)

    git push --tags origin --force
    
  • generated data must be removed from the repo history.

0
Zach Olivare On

This sounds like an issue that could be solved without external tools, by leveraging filter-branch.

If you want to remove all history of the data directory, you can run the following from the root of your repo.

git filter-branch --index-filter 'git rm --cached --ignore-unmatch -r path/to/data' HEAD

That will change every commit in the ancestry of your current HEAD pointer. You would then have to update all other branches and tags to these newly created commits to completely remove the baggage from your repo.