How to modify a blob considering both its file path and data with git filter-repo?

758 views Asked by At

For example at How to use git filter-repo as a library with the Python module interface? I managed to modify blobs of older commits for refactoring purposes with something like:

def blob_callback(blob, callback_metadata):
    blob.data = blob.data.replace(b'd1', b'asdf')

git_filter_repo.RepoFilter(
   args,
   blob_callback=blob_callback
).run()

But I could not find the path of the blob, which would be an useful information to have, notably to determine the filetype from the file extension and adapt the data modifications accordingly.

If that is not possible with blob_callback, I would expect that certainly a commit_callback should allow that, so I tried stuff like:

#!/usr/bin/env python

# https://stackoverflow.com/questions/64160917/how-to-use-git-filter-repo-as-a-library-with-the-python-module-interface/64160918#64160918

import git_filter_repo

def blob_callback(blob, callback_metadata):
    blob.data = blob.data.replace(b'd1', b'asdf')

def commit_callback(commit, callback_metadata):
    for file_change in commit.file_changes:
        print(commit)
        print(file_change)
        print(file_change.filename)
        print(file_change.blob_id)
        print(callback_metadata)
        print()

# Args deduced from:
# print(git_filter_repo.FilteringOptions.parse_args(['--refs', 'HEAD', '--force'], error_on_empty=False))
args = git_filter_repo.FilteringOptions.default_options()
args.force = True
args.partial = True
args.refs = ['HEAD']
args.repack=False
args.replace_refs='update-no-add'

git_filter_repo.RepoFilter(
   args,
   # blob_callback=blob_callback
   commit_callback=commit_callback
).run()

This time, I did manage to get the blob path at print(file_change.filename), but not the blob data.

I have that blob_id, but I don't know how to use it.

I guess that I could do it in two passes, one commit callback to create a map from blob IDs to paths, and the second blob callback to use that information, but it feels a bit ugly.

Is there a better way to have access to both, e.g. some fields of commit_callback arguments that I missed?

Ping on issue tracker: https://github.com/newren/git-filter-repo/issues/158

Tested in git filter-repo ac039ecc095d.

1

There are 1 answers

0
Ciro Santilli OurBigBook.com On

Elijah, the filter-repo project lead replied: https://github.com/newren/git-filter-repo/issues/158#issuecomment-702962073 and explained it is not possible without "hacks".

He pointed me to this in-tree example: https://github.com/newren/git-filter-repo/blob/7b3e714b94a6e5b9f478cb981c7f560ef3f36506/contrib/filter-repo-demos/lint-history#L152 which does it with a commit filter + calling git cat-file.

The underlying problem is that a blob could have been sent on the git fast-export stream much earlier, and only referenced by ID later on in a second commit that adds an identical blob. And keeping everything in memory would in general blow memory on large repos.