Im trying to use urllib2 to download a webpage and save it to a MySQL database. like this :
result_text = result.read()
result_text = result_text.decode('utf-8')
however I get this error :
Data: 'utf8' codec can't decode byte 0x88
Now, the HTML meta tag states that the encoding is indeed utf-8. Ive managed to get around this with this line :
result_text = result_text.decode('utf-8','replace')
Which replaces the bad characters. however, i'm not sure that this is not an indication that something could be wrong with the downloaded data, or that i'm removing valuable characters. IU should add that the page also contains JavaScript - although this shouldn't be a problem I believe.
Can anyone tell me why this is happening? Thanks
Analysis of your tiny data sample:
(1) That's certainly not UTF-8 with an occasional invalid sequence; over 50% of the unicode characters are invalid. In other words, pressing ahead and using
data.decode('utf8', 'replace')
is NOT a good idea (based on this TINY sample).(2) The characters
\x01
(twice) and\x08
make me suspect that you have got binary data somehow.(3) The (truncated) error message that you quoted in a comment mentioned
0x88
but there is no0x88
in the sample data.(4) Please edit your question to show what you should have done at the start: (a) the minimal code necessary to reproduce the problem, including the URL that you are accessing (b) the full error message and traceback (c) an assurance that you have copied/pasted (a) and (b) rather than typing from memory