Decode function tries to encode Python

6k views Asked by At

I am trying to print a unicode string without the specific encoding hex in it. I'm grabbing this data from facebook which has an encoding type in the html headers of UTF-8. When I print the type - it says its unicode, but then when I try to decode it with unicode-escape says there is an encoding error. Why is it trying to encode when I use the decode method?

Code

a='really long string of unicode html text that i wont reprint'
print type(a)
 >>> <type 'unicode'>   
print a.decode('unicode-escape')
 >>> Traceback (most recent call last):
  File "scfbp.py", line 203, in myFunctionPage
    print a.decode('unicode-escape')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 1945: ordinal not in range(128)
3

There are 3 answers

8
Mark Byers On BEST ANSWER

It's not the decode that's failing. It's because you are trying to display the result to the console. When you use print it encodes the string using the default encoding which is ASCII. Don't use print and it should work.

>>> a=u'really long string containing \\u20ac and some other text'
>>> type(a)
<type 'unicode'>
>>> a.decode('unicode-escape')
u'really long string containing \u20ac and some other text'
>>> print a.decode('unicode-escape')
Traceback (most recent call last):
  File "<stdin>", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 30: ordinal not in range(128)

I'd recommend using IDLE or some other interpreter that can output unicode, then you won't get this problem.


Update: Note that this is not the same as the situtation with one less backslash, where it fails during the decode, but with the same error message:

>>> a=u'really long string containing \u20ac and some other text'
>>> type(a)
<type 'unicode'>
>>> a.decode('unicode-escape')
Traceback (most recent call last):
  File "<stdin>", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 30: ordinal not in range(128)
0
Skurmedel On

When you print to the console Python tries to encode (convert) the string to the character set of your terminal. If this is not UTF-8, or something that doesn't map all the characters in the string, it will whine and throw an exception.

This snags me every now and then when I do quick processing of data, with for example Turkish characters in it.

If you are running python.exe through the Windows command prompt you can find some solutions here: What encoding/code page is cmd.exe using. Basically you can change the codepage with chcp but it's quite cumbersome. I would follow Mark's advice and use something like IDLE.

0
Lennart Regebro On
>>> print type(a)
<type 'unicode'>
>>> a.decode('unicode-escape')

Why is it trying to encode when I use the decode method?

Because you decode to Unicode, and you encode from. You just tried to decode a unicode string to unicode. The first thing it then does is try to convert it to a string, with the ascii codec. That's why you get:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2110' in position 3: ordinal not in range(128)

Remember: Unicode is not an encoding. Everything else is, like ascii, utf8, latin-1 etc.

This implicit encoding is gone in Python 3, btw, because it confuses people.