Python: Bytes to string with accented characters

1.9k views Asked by At

I have git reading the file name "ùàèòùèòùùè.txt" as a simple string of bytes, so when I ask git for a list of commited files, I'm given the following string:

r"\303\271\303\240\303\250\303\262\303\271\303\250\303\262\303\271\303\271\303\250.txt"

How can I use Python 2 to have it back to "ùàèòùèòùùè.txt"?

1

There are 1 answers

5
Martijn Pieters On BEST ANSWER

If the git format contains literal \ddd sequences (so up to 4 characters per filename byte) you can use the string_escape (Python 2) or unicode_escape (Python 3) codecs to have Python interpret the escape sequences.

You'll get UTF-8 data; my terminal is set to interpret UTF-8 directly:

>>> git_data = r"\303\271\303\240\303\250\303\262\303\271\303\250\303\262\303\271\303\271\303\250.txt"
>>> git_data.decode('string_escape')
'\xc3\xb9\xc3\xa0\xc3\xa8\xc3\xb2\xc3\xb9\xc3\xa8\xc3\xb2\xc3\xb9\xc3\xb9\xc3\xa8.txt'
>>> print git_data.decode('string_escape')
ùàèòùèòùùè.txt

You'd want to decode that as UTF-8 to get text:

>>> git_data.decode('string_escape').decode('utf8')
u'\xf9\xe0\xe8\xf2\xf9\xe8\xf2\xf9\xf9\xe8.txt'
>>> print git_data.decode('string_escape').decode('utf8')
ùàèòùèòùùè.txt

In Python 3, the unicode_escape codec gives you (Unicode) text so an extra encode to Latin-1 is required to make it bytes again:

>>> git_data = rb"\303\271\303\240\303\250\303\262\303\271\303\250\303\262\303\271\303\271\303\250.txt"
>>> git_data.decode('unicode_escape').encode('latin1').decode('utf8')
'ùàèòùèòùùè.txt'

Note that git_data is a bytes object before decoding.