Recovering filenames with bad encoding

1.9k views Asked by At

I've been struggling with this problem for a while but working with encoding is so painful that I have to come to your smarter minds for some help.

In a trip I made to Ukraine a friend copied to my pen drive me some Ukrainian named files. However, as you might expect, in the process of copying to my computer the filenames became impossible to read rubbish, such as this:

Ôàíòîì

Well, I have strong reasons to believe that the original filenames were encoding using CP1251 (I know this because I manually checked encode tables and manage to translate correctly the name of the band). What apparently happened is that, in the process of copying, the CP1251 codes where maintained and the OS now just interprets them as Unicode codes.

I tried to "interpret" the codes in Python with the following script:

print u"Ôàíòîì".decode('cp1251')

It doesn't feel right though. The result is complete rubbish as well:

Ôàíòîì

If i do:

print repr(u"Ôàíòîì".decode('cp1251'))

I obtain:

u'\u0413\u201d\u0413\xa0\u0413\xad\u0413\u0406\u0413\xae\u0413\xac'

I found out that if I could get all the code points in Unicode and just offset them by 0x350 I would place them in the correct place for Ukrainian cyrillic. But I don't know how to do that and probably there is an answer which is more conceptually correct than this.

Any help would be greatly appreciated!

Edit: Here is an example of the correct translation

Ôàíòîì should translate to Фантом.

Ô 0x00D4 -> Ф 0x0424
à 0x00E0 -> а 0x0430
í 0x00ED -> н 0x043D
ò 0x00F2 -> т 0x0442
î 0x00EE -> о 0x043E
ì 0x00EC -> м 0x043C

As I stated before, there is an 0x0350 offset between the correct and wrong code points.

(ok, the files are music files... I guess you suspected that...)

Some other test strings (whose translation I don't know): Áåç êîíò›îëfl Äâîº Êàï_òîøêà Ïîäèâèñü

4

There are 4 answers

0
Felipe Ferri On BEST ANSWER

I found out that, besides the filenames, all my files had incorrectly encoded metadata.

I found out that the id3 metadata standard for mp3 files only supports latin1, utf8 and utf16 encodings.

My files all contained CP1251 data that were set as latin1 on the mp3 files. Probably in Russia and cyrillic-writing countries all music players are set to understand that latin1 should be interpreted as CP1251, which was not the case for me.

I used Python and mutagen for correcting the metadata. When reading the mp3 metadata, mutagen assumed that the data was encoded as latin1, showing garbled characters as a result. What I had to do was to get those garbled characters, encode them into latin1 again AND decode as CP1251, obtaining unicode. Then I overwrote the mp3 metadata and mutagen understood then that the unicode should be saved as utf-8. With that all the metadata was correct.

To correct the files metadata I used the following Python script:

from mutagen.easyid3 import EasyID3

def decode_song_metadata(filename):
    id3 = EasyID3(filename)
    for key in id3.valid_keys:
        val = id3.get(key)
        if val:
            print key
            decoded = val[0].encode('latin1').decode('cp1251')
            print decoded
            id3[key] = decoded
    id3.save()

def correct_metadata():
    paths = [u'/Users/felipe/Downloads/Songs']    

    for path in paths:
        print 'path: ' + decode_filename(path)
        for dirpath, dirnames, filenames in os.walk(path):
            for filename in filenames:
                try:
                    decode_song_metadata(os.path.join(dirpath, filename))
                except:
                    print filename


if __name__ == '__main__':
    correct_metadata()

This corrected the mp3 metadata, however correcting the filenames required a different trick, because they had a differend encoding problem. What I think happend was that the original filenames were in CP1251, but when they were copied from my fat32 formatted usb-stick to my Mac, macOS interpreted the filenames as latin1. This originated filenames with weird accented characters, which were encoded in UTF-16 in "Normal Form Decomposed", where each accent is saved as a different unicode character than the main letter. Also macOS added a BOM marker that polluted the filename. So in order to correct this I had to do the reverse operation:

  • get the filename. This returns a unicode string which latin accented characters in the Normal Form Decomposed.
  • we have to convert again to Normal Form Composed.
  • then we encode in UTF-16.
  • we remove the BOM.
  • we decode interpreting as CP1251.

In order to decode the filenames then I used the following script:

def decode_filename(filename):
    # MacOS filenames are stored in Unicode in "Normal Form Decomposed"
    # form, where the accents are saved separated from the main
    # character. Because the original characters weren't proper
    # accentuated letters, in order to recover them we have to decompose
    # the filenames.
    # http://stackoverflow.com/a/16467505/212292
    norm_filename = unicodedata.normalize('NFC', filename)
    utf16 = norm_filename.encode('utf16')
    bom = codecs.BOM_UTF16

    if utf16.startswith(bom):
        # We have to remove the BOM bytes
        utf16 = utf16[len(bom):]

    cp1251 = utf16.decode('cp1251')
    return cp1251

This should be used with the unicode returned by running the os.walk() method.

Though the above script works, I ended up not using it for correcting the filenames. I was using iTunes with the "Auto organizer" function enabled. This was great because everytime I would play a song on iTunes it would get the mp3 metadata (which I already corrected using the first script above) and rename the mp3 file to contain the song name, and even the folder. I find this better than correcting the filenames because this also renames correctly the folders and puts filenames which make sense to the song.

0
Rolf of Saxony On
>>> a = u'Ôàíòîì'.encode('8859').decode('cp1251')   
>>> print a   
Фантом    

If you look at the individual characters in your samples most of them come from Cyrillic but you have others in there from Greek and Coptic, Latin Extended B and u'fe52' is a full-stop from the back of beyond. So it's a bit of a mess.
EDIT:

a = u'Ôàíòîì'.encode('cp1252').decode('cp1251')
print a
Фантом
a = u'Äâîº Êàï_òîøêà'.encode('cp1252').decode('cp1251')
print a
Двоє Кап_тошка
a = u'Ïîäèâèñü'.encode('cp1252').decode('cp1251')
print a
Подивись
a = u'Áåç êîíò›îë'.encode('cp1252').decode('cp1251')
print a
Без конт›ол

cp1252 works for the given samples, except for Áåç êîíò›îëfl where the Latin Small Ligature Fl U+FB02 appears to be superfluous

1
user2386841 On

You can add this 0x350 offset like that:

Python 2:

>>> s = u'Ôàíòîì'
>>> decoded = u''.join([unichr(ord(c)+0x350) for c in s])
>>> print decoded
Фантом
1
dan04 On
>>> u'Ôàíòîì'.encode('latin1').decode('cp1251')
'Фантом'