git-filter-branch to remove strings, but where strings contain $ ' \ and other characters

1.8k views Asked by At

I'm trying to rewrite history, using:

git filter-branch --tree-filter 'git ls-files -z "*.php" |xargs -0 perl -p -i -e "s#(PASSWORD1|PASSWORD2|PASSWORD3)#xXxXxXxXxXx#g"' -- --all

as described in this tutorial.

However, the password strings I have contain all kinds of non- A-Z characters, e.g. $ ' and \, rather than being nice simple 'PASSWORD1' type strings in the example above.

Can someone explain what escaping I need? I've not been able to find this anywhere, and I've been battling with this for hours.

4

There are 4 answers

10
Roberto Tyley On BEST ANSWER

try the BFG instead of git filter-branch...

You can use a much more friendly substitution format if you use The BFG rather than git-filter-branch. Create a passwords.txt file, with one password per line like this:

PASSWORD1==>xXxXx      # Replace literal string 'PASSWORD1' with 'xXxXx'
ezxcdf\fr$sdd%==>xXxXx # ...all text is matched as a *literal* string by default

Then run the BFG with this command:

$ java -jar bfg.jar -fi '*.php' --replace-text passwords.txt  my-repo.git

Your entire repository history will be scanned, and all .php files (under 1MB in size) will have the substitutions performed: any matching string (that isn't in your latest commit) will be replaced.

...no escaping needed

Note that the only bit of parsing the BFG does with here with the substitution file is to split on the '==>' string - which probably isn't in your passwords - and all text is interpreted literally by default.

If you want to be even more concise, you can drop the '==>' and everything that comes after it on each line (ie, just have a file of passwords) and The BFG will replace each password with the string '***REMOVED***' by default.

The BFG is typically hundreds of times faster than running git-filter-branch on a big repo and the options are tailored around these two common use-cases:

  • Removing Crazy Big Files
  • Removing Passwords, Credentials & other Private data

Full disclosure: I'm the author of the BFG Repo-Cleaner.

0
ikegami On

Build it from the inside out. Say the password is

a$b'c\d

The regex pattern would be

a\$b'c\\d

One possibility for the perl command would be

perl -i -pe's/a\$b'\''c\\d/.../g'

(Note how each ' was replaced with '\''.)

Now you need to include that in single quotes, so you repeat the process.

... '... perl -i -pe'\''s/a\$b'\''\'\'''\''c\\d/.../g'\''' ...
0
fooquency On

Building on the brilliant help given by konsolebox which really helped me solve this, the solution I ended up using in terms of doing it via the shell was:

Define the strings in a file, strings.txt

string1
another$string
yet! @nother string
some more stuff to re\move

Create a Perl script perl-escape-strings.pl which will be used to escape the strings, where xXxXxXxXxXx is the string they will all be replaced with

#!/usr/bin/perl

use strict;
use warnings;

while (<>)
{
        chomp;
        my $passwd = quotemeta($_);
        print qq|s/$passwd/xXxXxXxXxXx/g;\n|;
}

exit 0;

Bash script:

# Pre-process the strings
./perl-escape-strings.pl strings.txt > strings-perl-escaped.txt

# Change directory to the repo
cd repo/

# Define the filter command
FILTER="git ls-files -z '*.html' '*.php' | xargs -0 perl -p -i ../strings-perl-escaped.txt"

# Run the filter
git filter-branch --tree-filter "$FILTER" -- --all

However, because the number of strings is large, and my repository is large and with many thousand commits, the filter-branch method is taking a long time. So I'm going to try The BFG mentioned in another answer also in parallel, to see if it completes quicker.

30
konsolebox On

Using a wrapper script:

#!/bin/bash

readarray -t PASSWORDS < list_file

REPLACEMENT='xXxXxXxXxXx'
SEP=$'\xFF'

EXPR=${PASSWORDS[0]}
for (( I = 1; I < ${#PASSWORDS[@]}; ++I )); do
    EXPR+="|${PASSWORDS[I]}"
done
EXPR="s${SEP}(${EXPR})${SEP}$REPLACEMENT${SEP}g"
EXPR=${EXPR//'\'/'\\\\'}; EXPR=${EXPR//'$'/'\\\$'}
EXPR=${EXPR//'"'/'\"'};   EXPR=${EXPR//'`','\`'}
EXPR=${EXPR//'^','\\^'};  EXPR=${EXPR//'[','\\['}
EXPR=${EXPR//']','\\]'};  EXPR=${EXPR//'+','\\+'}
EXPR=${EXPR//'?','\\?'};  EXPR=${EXPR//'.','\\.'}
EXPR=${EXPR//'*','\\*'};  EXPR=${EXPR//'{','\\{'}
EXPR=${EXPR//'}','\\}'};  EXPR=${EXPR//'(','\\('}
EXPR=${EXPR//')','\\)'}

FILTER="git ls-files -z '*.php' | xargs -0 perl -p -i -e \"$EXPR\""

echo "Number of passwords: ${#PASSWORDS[@]}"    
echo "Passwords:" "${PASSWORDS[@]}"
echo "EXPR: $EXPR"
echo "FILTER: $FILTER"

git filter-branch --tree-filter "$FILTER" -- --all