How to use "git grep" to search for 8-bit encoded text in a files with the same encoding

711 views Asked by At

I have a project where the files are 8-bit encoded (Win-1251). Can you please tell me if there is a way using git grep to find a phrase composed of characters from the top of the ASCII table (i.e. with codes from 0x80 to 0xFF)?

I work under Windows. I use the console to work with git, and it seems that the text that I pass to search in git grep (for example, git grep "привет") is perceived by this utility as a sequence of utf-8 characters, i.e. git grep is actually trying to find the sequence of bytes "\xD0\xBF\xD1\x80\xD0\xB8\xD0\xB2\xD0\xB5\xD1\x82".

I also tried to execute this command for searching: git grep "\xEF\xF0\xE8\xE2\xE5\xF2" (where byte sequence in quotes is ASCII codes of "привет" word in Win-1251), but it turned out that grep does not accept escape sequences.

2

There are 2 answers

2
LeGEC On

Try using the -P flag : Perl regexp should understand escape sequences


You could write your search patterns in a file, you can then tell git grep to read the search patterns from this file : git grep -f patterns.txt ...

The bonus of a file is that you can more easily control the encoding of its content.

You can also use this feature to build a script, that would turn a UTF8 string and encode it as Win-1251 before feeding it to git grep :

pattern=$1
shift
echo $pattern | iconv -t WINDOWS-1251 > /tmp/rusgrep-pattern
git grep -f /tmp/rusgrep-pattern "$@"
4
Omer Tuchfeld On

Inspired by this gist and @LeGEC 's answer, you can do something like this -

git grep -P "$(iconv -f utf-8 <(echo -n 'привет') -t 'Windows-1251' | od -tx1 | sed -e 's/^[0-9]* //' -e '$d' -e 's/^/ /' -e 's/ /\\x/g')"

You can put this in a bash function

function gitbingrep {
    git grep -P "$(iconv -f utf-8 <(echo -n "$1") -t 'Windows-1251' | od -tx1 | sed -e 's/^[0-9]* //' -e '$d' -e 's/^/ /' -e 's/ /\\x/g')"
}

And now you can simply run gitbingrep привет