How can I handle mal-encoded character with Python 2?

Question

114 views Asked by Daisuki Honey At 27 November 2014 at 09:28

The HTML file I am fetching has some characters that are not supported by the encoding specified in HTML header:

I found the following ones are not supported by Shift_JIS encoding but actually used. My browser can correctly show those characters.

When I try to read this HTML file and decode for processing, I get UnicodeDecodeError.

url = 'http://matsucon.net/material/dic/kao09.html'
response = urllib2.urlopen(url)
response.read().decode('shift_jis_2004')

Any good way to process the HTML that has mal-encoded characters without getting an error?

There are 1 answers

**Irshad Bhat** · Accepted Answer · 2014-11-27T09:40:07+00:00

Irshad Bhat On 27 November 2014 at 09:40 BEST ANSWER

Try this:

response.read().decode('shift_jis_2004',errors='ignore')