How to convert a string from cp1251 to UTF-8 in Python3?

3.2k views Asked by At

A help needed with a pretty simple Python 3.6 script.

First, it downloads an HTML file from an old-fashioned server which uses cp1251 encoding.

Then I need to put the file contents into a UTF-8 encoded string.

Here is what I'm doing:

import requests
import codecs

#getting the file
ri = requests.get('http://old.moluch.ru/_python_test/0.html')

#checking that it's in cp1251
print(ri.encoding)

#encoding using cp1251
text = ri.text
text = codecs.encode(text,'cp1251')

#decoding using utf-8 - ERROR HERE!
text = codecs.decode(text,'utf-8')

print(text)

Here is the error:

Traceback (most recent call last):
  File "main.py", line 15, in <module>
    text = codecs.decode(text,'utf-8')
  File "/var/lang/lib/python3.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 43: invalid continuation byte

I'd really appreciate any help with it.

4

There are 4 answers

2
Tomalak On BEST ANSWER

Not sure what you are trying to do.

.text is the text of the response, a Python string. Encodings don't play any role in Python strings.

Encodings only play a role when you have a stream of bytes that you want to convert to a string (or the other way around). And the requests module already does that for you.

import requests

ri = requests.get('http://old.moluch.ru/_python_test/0.html')
print(ri.text)

For example, assume you have a text file (i.e.: bytes). Then you must pick an encoding when you open() the file - the choice of encoding determines how the bytes in the file are converted into characters. This manual step is necessary because open() cannot know what encoding the bytes of the file are in.

HTTP on the other hand sends this in the response headers (Content-Type), so requests can know this information. Being a high-level module, it helpfully looks at the HTTP headers and converts the incoming bytes for you. (If you would use the much more low-level urllib, you'd have to do your own decoding.)

The .encoding property is purely informational when you use the .text of the response. It might be relevant if you use the .raw property, though. For work with servers that return regular text responses, using .raw is seldom necessary.

0
PythonSherpa On

You don't need to do the encoding/decoding.

"When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text"

So this will work:

import requests

#getting the file
ri = requests.get('http://old.moluch.ru/_python_test/0.html')

text = ri.text

print(text)

You can also access the response body as bytes, for non-text requests:

ri.content

Please check out the requests documentation

0
Hadi Rahjoo On

you can simply ignore the error with adding a setting to the decode function :

text = codecs.decode(text,'utf-8',errors='ignore')
0
NoorJafri On

When many of the people have already answered that you are getting the decoded message when you make requests.get. I will answer to the error you are facing right now.

This Line:

text = codecs.encode(text,'cp1251')

Encodes the text into cp1251, you are then trying to decode it using utf-8 which gives you the error here:

text = codecs.decode(text,'utf-8')

For detecting the types you can use:

import chardet
text = codecs.encode(text,'cp1251')
chardet.detect(text) . #output {'encoding': 'windows-1251', 'confidence': 0.99, 'language': 'Russian'}

#OR
text = codecs.encode(text,'utf-8')
chardet.detect(text) . #output {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

So encoding in one format and then decoding in other causes the error.