How to deal with strings where encoding is unclear

255 views Asked by At

I know there is quite a lot on the web and on stackoverflow about Python and character encoding, but I haven't really found the answer I'm looking for. So at the risk of creating a duplicate, I'm going to ask anyway.

It's a script that gets a dictionary, where all keys are properly as unicode. The values are strings with unknown encoding. For the keys it wouldn't matter that much, keys are all very simple very unlike the values. The values can (and do) contain a large variety of encodings. There are some dictionaries, where some values are in ASCII others as UTF-16BE yet others cp1250.

That totally messes up further processing, which currently consists mainly printing or concatenating (yes, that simple).

The work-around that I came up with, which makes Python print statements work properly is:

for key in data.keys():
   # hope they did not chose a funky encoding
   try:
       print key+":"+data[key] # this triggers a UnicodeDecodeError on many encodings
       current_data = data[key]
   except UnicodeDecodeError:
   # trying to cope with a funky encoding             
        current_data = data[key].decode(chardet.detect(data[key])['encoding']) # doing this on each value, because the dictionary sometimes contains multiple encodings
        print key+":", # printing without newline was a workaround, because connecting didn't work
        print current_data.encode('UTF-8')

In Python this works just fine. In Jython 2.7rc1 which I use in the project (not an option to switch), it prints characters which are definitely not the original encoding (funky looking characters). If anyone has an idea how I can make this also work in Jython that'd be great!

Edit (Example): Sample-Value:

Our latest scenarios explore two possible versions of the future seen through fresh “lenses”. 

Creates a string where the right and left double quotes turn to \x8D and \x8E. I don't know what encoding that is. In Python after using the above code it strips them. In Jython it turns them into white squares.

1

There are 1 answers

4
Andrew Winterbotham On BEST ANSWER

I'm not familiar with Jython, but the following link I found may prove useful: http://python.6.x6.nabble.com/character-encoding-issues-td1766833.html

It says that you should keep all unicode strings in separate files to your source, and read them with codecs.open. This seemed to work for the person who was experiencing a problem similar to yours.

The following link also mentions something about specifying an encoding parameter to the JVM: https://answers.launchpad.net/sikuli/+question/156443

Without seeing any actual error output, this is the extent of the help I can provide.