python :same character, different behavior

Question

python :same character, different behavior

164 views Asked by Federico Leoni At 25 June 2015 at 18:09

I'm generating file names from a list pulled out from a postgres DB with Python 2.7.9. In this list there are words with special char. Normally I use ''.join() to record the name and fire it to my loader but I have just one name that want be recognized. the .py is set for utf-8 coding, but the words are in Portuguese, I think latin-1 coding.

from pydub import AudioSegment
from pydub.playback import play
templist = ['+ Orégano','- Búfala','+ Rúcola']
count_ins = (len(templist)-1)
while (count_ins >= 0 ):
    kot_istructions = AudioSegment.from_ogg('/home/effe/voice_orders/Voz/'+"".join(templist[count_ins])+'.ogg')
    count_ins-=1
    play(kot_istructions)

The first two files are loaded:

/home/effe/voice_orders/Voz/+ Orégano.ogg

/home/effe/voice_orders/Voz/- Búfala.ogg

The third should be:

/home/effe/voice_orders/Voz/+ Rúcola.ogg

But python is trying to load

/home/effe/voice_orders/Voz/+ R\xc3\xbacola.ogg

Why just this one? I've tried to use normalize() to remove the accent but since this is a string the method didn't work. Print works well, as db update. Just file name creation doesn't works as expected. Suggestions?

Original Q&A

There are 2 answers

Federico Leoni On 27 June 2015 at 00:37

Solved: Was a problem with the file. Deleting and build it again do the job.

**Danver Braganza** · Accepted Answer · 2015-06-25T18:47:49+00:00

It seems the root cause might be that the encoding of these names in inconsisitent within your database.

If you run:

>>> 'R\xc3\xbacola'.decode('utf-8')

You get

u'R\xfacola'

which is in fact a Python unicode, correctly representing the name. So, what should you do? Although it's a really unclean programming style, you could play .encode()/.decode() whackamole, where you try to decode the raw string from your db using utf-8, and failing that, latin-1. It would look something like this:

try:
    clean_unicode = dirty_string.decode('utf-8')
except UnicodeDecodeError:
    clean_unicode = dirty_string.decode('latin-1')

As a general rule, always work with clean unicode objects within your own source, and only convert to an encoding on saving it out. Also, don't let people insert data into a database without specifying the encoding, as that will stop you from having this problem in the first place.

Hope that helps!

TechQA.

python :same character, different behavior

There are 2 answers

Related Questions in PYTHON

Related Questions in STRING

Related Questions in UNICODE

Related Questions in DECODE

Related Questions in ENCODE

Popular Questions

Popular Tags

Trending Questions