The HTML file I am fetching has some characters that are not supported by the encoding specified in HTML header:
I found the following ones are not supported by Shift_JIS encoding but actually used. My browser can correctly show those characters.
- ∑ n-ary summation U+2211
- ゚ halfwidth katakana semi-voiced sound mark U+FF9F
- Д cyrillic capital letter de U+414
When I try to read this HTML file and decode for processing, I get UnicodeDecodeError.
url = 'http://matsucon.net/material/dic/kao09.html'
response = urllib2.urlopen(url)
response.read().decode('shift_jis_2004')
Any good way to process the HTML that has mal-encoded characters without getting an error?
Try this: