cx_Oracle - encoding query result to Raw

6.2k views Asked by At

EDIT:

The following print shows my intended value.

(both sys.stdout.encoding and sys.stdin.encoding are 'UTF-8').

Why is the variable value different than its print value? I need to get the raw value into a variable.

>>username = 'Jo\xc3\xa3o'
>>username.decode('utf-8').encode('latin-1')
'Jo\xe3o'
>>print username.decode('utf-8').encode('latin-1')
João

Original question:

I'm having an issue querying a BD and decoding the values into Python.

I confirmed my DB NLS_LANG using

select property_value from database_properties where property_name='NLS_CHARACTERSET';

'''AL32UTF8 stores characters beyond U+FFFF as four bytes (exactly as Unicode defines 
UTF-8). Oracle’s “UTF8” stores these characters as a sequence of two UTF-16 surrogate
characters encoded using UTF-8 (or six bytes per character)'''

os.environ["NLS_LANG"] = ".AL32UTF8"

....
conn_data = str('%s/%s@%s') % (db_usr, db_pwd, db_sid)

sql = "select user_name apex.users where user_id = '%s'" % userid

...

cursor.execute(sql)
ldap_username = cursor.fetchone()
...

where

print ldap_username
>>'Jo\xc3\xa3o'

I've both tried (which return the same)

ldap_username.decode('utf-8')
>>u'Jo\xe3o'
unicode(ldap_username, 'utf-8')
>>u'Jo\xe3o'

where

u'João'.encode('utf-8')
>>'Jo\xc3\xa3o'

how to get the queries result back to the proper 'João' ?

1

There are 1 answers

6
jro On BEST ANSWER

You already have the proper 'João', methinks. The difference between >>> 'Jo\xc3\xa3o' and >>> print 'Jo\xc3\xa3o' is that the former calls repr on the object, while the latter calls str (or probably unicode, in your case). It's just how the string is represented.

Some examples might make this more clear:

>>> print 'Jo\xc3\xa3o'.decode('utf-8')
João
>>> 'Jo\xc3\xa3o'.decode('utf-8')
u'Jo\xe3o'
>>> print repr('Jo\xc3\xa3o'.decode('utf-8'))
u'Jo\xe3o'

Notice how the second and third result are identical. The original ldap_username currently is an ASCII string. You can see this on the Python prompt: when it is displaying an ACSII object, it shows as 'ASCII string', while Unicode objects are shown as u'Unicode string' -- the key being the leading u.

So, as your ldap_username reads as 'Jo\xc3\xa3o', and is an ASCII string, the following applies:

>>> 'Jo\xc3\xa3o'.decode('utf-8')
u'Jo\xe3o'
>>> print 'Jo\xc3\xa3o'.decode('utf-8') # To Unicode...
João
>>> u'João'.encode('utf-8')             # ... back to ASCII
'Jo\xc3\xa3o'

Summed up: you need to determine the type of the string (use type when unsure), and based on that, decode to Unicode, or encode to ASCII.