How to get rid of some characters from string? .replace() doesn't work

410 views Asked by At

I need to get rid of polish characters from string I got from xml file. I use .replace() but in this case it doesn't work. Why? The code:

# -*- coding: utf-8
from prestapyt import PrestaShopWebService
from xml.etree import ElementTree

prestashop = PrestaShopWebService('http://localhost/prestashop/api', 
                              'key')
prestashop.debug = True

name = ElementTree.tostring(prestashop.search('products', options=
{'display': '[name]', 'filter[id]': '[2]'}), encoding='cp852',  
method='text')

print name
print name.replace('ł', 'l')

Output:

Naturalne mydło odświeżające
Naturalne mydło odświeżające

But when I try to replace non polish character it works fine.

print name
print name.replace('a', 'o')

Result:

Naturalne mydło odświeżające
Noturolne mydło odświeżojące

This also work's fine:

name = "Naturalne mydło odświeżające"
print name.replace('ł', 'l')

Any advise?

2

There are 2 answers

1
Mark Tolonen On BEST ANSWER

You are mixing encodings with your byte strings. Here's a short working example reproducing the issue. I assume you are running in a Windows console that defaults to an encoding of cp852:

#!python2
# coding: utf-8
from xml.etree import ElementTree as et
name_element = et.Element('data')
name_element.text = u'Naturalne mydło odświeżające'
name = et.tostring(name_element,encoding='cp852', method='text')
print name
print name.replace('ł', 'l')

Output (no replacement):

Naturalne mydło odświeżające
Naturalne mydło odświeżające

The reason is, the name string was encoded in cp852 but the byte string constant 'ł' is encoded in the source code encoding of utf-8.

print repr(name)
print repr('ł')

Output:

'Naturalne myd\x88o od\x98wie\xbeaj\xa5ce'
'\xc5\x82'

The best solution is to use Unicode strings:

#!python2
# coding: utf-8
from xml.etree import ElementTree as et
name_element = et.Element('data')
name_element.text = u'Naturalne mydło odświeżające'
name = et.tostring(name_element,encoding='cp852', method='text').decode('cp852')
print name
print name.replace(u'ł', u'l')
print repr(name)
print repr(u'ł')

Output (replacement was made):

Naturalne mydło odświeżające
Naturalne mydlo odświeżające
u'Naturalne myd\u0142o od\u015bwie\u017caj\u0105ce'
u'\u0142'

Note that Python 3's et.tostring has a Unicode option, and string constants are Unicode by default. The repr() version of the string is more readable as well, but ascii() implements the old behavior. You'll also find that Python 3.6 will print Polish even to consoles not using a Polish code page, so maybe you wouldn't need to replace the characters at all.

#!python3
# coding: utf-8
from xml.etree import ElementTree as et
name_element = et.Element('data')
name_element.text = 'Naturalne mydło odświeżające'
name = et.tostring(name_element,encoding='unicode', method='text')
print(name)
print(name.replace('ł','l'))
print(repr(name),repr('ł'))
print(ascii(name),ascii('ł'))

Output:

Naturalne mydło odświeżające
Naturalne mydlo odświeżające
'Naturalne mydło odświeżające' 'ł'
'Naturalne myd\u0142o od\u015bwie\u017caj\u0105ce' '\u0142'
1
Eric Duminil On

If I understand your problem correctly, you can use unidecode:

>>> from unidecode import unidecode
>>> unidecode("Naturalne mydło odświeżające")
'Naturalne mydlo odswiezajace'

You might have to decode your cp852 encoded string with name.decode('utf_8') first.