Russian symbols in Python output corrupted (ENCODING)

Question

Russian symbols in Python output corrupted (ENCODING)

3.8k views Asked by aaaapppp At 11 November 2014 at 16:42

I parsed a HTML document and have Russian text in it. When I'm trying to print it in Python, I get this:

ÐÐ»ÑÐ±Ð½Ð¸ÑÐ½ÑÐ¹ Ð½Ð¾Ð²Ð¾Ð³Ð¾Ð´Ð½Ð¸Ð¹ Ð¿ÑÐ½Ñ

I tried to decode it and I get ISO-8859-1 encoding. I'm trying to decode it like that:

print drink_name.decode('iso8859-1')

But I get an error. How can I print this text, or encode it in Unicode?

Original Q&A

There are 1 answers

**Martijn Pieters** · Answer 1 · 2014-11-11T16:44:56+00:00

You have a Mojibake; UTF-8 bytes decoded as Latin-1 or CP1251 in this case.

You can repair it by reversing the process:

>>> print u'ÐÐ»ÑÐ±Ð½Ð¸ÑÐ½ÑÐ¹ Ð½Ð¾Ð²Ð¾Ð³Ð¾Ð´Ð½Ð¸Ð¹ Ð¿ÑÐ½Ñ'.encode('latin1').decode('utf8')
Клубничный новогодний пунш

(I had to copy the string from the original post source to capture all the non-printable bytes in the Mojibake).

The better method would be to not incorrectly decoding in the first place. You decoded the original text with the wrong encoding, use UTF-8 as the codec instead.

If you used requests to download the page, do not use response.text in this case; if the server failed to specific codec then the HTTP RFC default is to use Latin-1, but HTML documents often embed the encoding in a <meta> header instead. Leave decoding in such cases to your parser, like BeautifulSoup:

response = requests.get(url)
soup = BeautifulSoup(response.content)  # pass in undecoded bytes

TechQA.

Russian symbols in Python output corrupted (ENCODING)

There are 1 answers

Related Questions in PYTHON

Related Questions in ENCODING

Related Questions in UTF-8

Related Questions in CYRILLIC

Related Questions in MOJIBAKE

Popular Questions

Trending Questions