A help needed with a pretty simple Python 3.6 script.
First, it downloads an HTML file from an old-fashioned server which uses cp1251 encoding.
Then I need to put the file contents into a UTF-8 encoded string.
Here is what I'm doing:
import requests
import codecs
#getting the file
ri = requests.get('http://old.moluch.ru/_python_test/0.html')
#checking that it's in cp1251
print(ri.encoding)
#encoding using cp1251
text = ri.text
text = codecs.encode(text,'cp1251')
#decoding using utf-8 - ERROR HERE!
text = codecs.decode(text,'utf-8')
print(text)
Here is the error:
Traceback (most recent call last):
File "main.py", line 15, in <module>
text = codecs.decode(text,'utf-8')
File "/var/lang/lib/python3.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 43: invalid continuation byte
I'd really appreciate any help with it.
Not sure what you are trying to do.
.text
is the text of the response, a Python string. Encodings don't play any role in Python strings.Encodings only play a role when you have a stream of bytes that you want to convert to a string (or the other way around). And the requests module already does that for you.
For example, assume you have a text file (i.e.: bytes). Then you must pick an encoding when you
open()
the file - the choice of encoding determines how the bytes in the file are converted into characters. This manual step is necessary becauseopen()
cannot know what encoding the bytes of the file are in.HTTP on the other hand sends this in the response headers (
Content-Type
), sorequests
can know this information. Being a high-level module, it helpfully looks at the HTTP headers and converts the incoming bytes for you. (If you would use the much more low-levelurllib
, you'd have to do your own decoding.)The
.encoding
property is purely informational when you use the.text
of the response. It might be relevant if you use the.raw
property, though. For work with servers that return regular text responses, using.raw
is seldom necessary.