I am having an issue with character encoding with Mutagen.
I casted the dict[key]
to Unicode, bu all I receive are errors. The character in question is U+00E9
or é
, but what I prints is é
. I am assuming the default character set for Mutagen is UTF-8, but is there a way to fix this?
Output:
Winter Wonderland.mp3
Album : Christmas
Album Artist: Michael Bublé
Artist : Michael Bublé
Composer : None
Disk : None
Encoded By : None
Genre : Christmas
Title : Winter Wonderland
Track : 17/19
Year : 2011
Code:
#!/usr/bin/env python
import os
import re
from mutagen.mp3 import MP3
first_cap_re = re.compile('(.)([A-Z][a-z]+)')
all_cap_re = re.compile('([a-z0-9])([A-Z])')
def convertCamelCase2Underscore(name):
s1 = first_cap_re.sub(r'\1_\2', name)
return all_cap_re.sub(r'\1_\2', s1).lower()
def convertCamelCase2CapitalizedWords(name):
return ' '.join([x.capitalize() for x in convertCamelCase2Underscore(name).split('_')])
def safeValue(dict, key):
return None if key not in dict else dict[key]
class Track:
def __init__(self, path):
audio = MP3(path)
self.title = safeValue(audio, 'TIT2')
self.artist = safeValue(audio, 'TPE1')
self.albumArtist = safeValue(audio, 'TPE2')
self.album = safeValue(audio, 'TALB')
self.genre = safeValue(audio, 'TCON')
self.year = safeValue(audio, 'TDRL')
self.encodedBy = safeValue(audio, 'TENC')
self.composer = safeValue(audio, 'TXXX:TCM')
self.track = safeValue(audio, 'TRCK')
self.disk = safeValue(audio, 'TXXX:TPA')
def __repr__(self):
ret = ''
fields = self.__dict__
for k, v in sorted(self.__dict__.iteritems()):
ret += '{:12s}: {:s}\n'.format(convertCamelCase2CapitalizedWords(k), v)
return ret
files = os.listdir('.')
for filename in files:
print filename
print Track(filename)
Mutagen returns Unicode strings, though wrapped in a
TextFrame
object. When youprint
that object it's an implicitstr()
conversion of thetext
property to bytes, and Mutagen (arbitrarily) chooses UTF-8 for that encoding.Unfortunately the Windows console doesn't support UTF-8[1]. The encoding it uses varies but in your case you are getting the US DOS code page 437 where the byte sequence 0xC3 0xA9 represents
é
and noté
. You could try to print to the console in the encoding that it wants by explicitly encoding to it:but this will still only allow you to print characters that are supported in that code page. 437 is OK for Michael Bublé, but not so good for 東京事変. There isn't a good way to get Unicode out to the Windows console.[2]
[1] There is code page 65001 which is supposed to be UTF-8, but there are bugs in the MS implementation which usually make it unusable.
[2] You can, if you must, call the Win32 API
WriteConsoleW
directly usingctypes
, but then you have to take care to only do that when you are connected to a Windows console and not any other type of stream so you don't break everywhere else. It's usually not worth it; Windows users are assumed to be used to a console where non-ASCII characters just break all the time.