While writing a scraper on ScraperWiki, I was repeatedly getting this message when trying to save a UTF8-encoded string:
UnicodeDecodeError('utf8', ' the \xe2...', 49, 52, 'invalid data')
I eventually worked out, by trial and UnicodeDecodeError, that the ScraperWiki datastore seems to expect Unicode.
So I'm now decoding from UTF-8 and converting everything to Unicode immediately before saving to the datastore:
try:
for k, v in record.items():
record[k] = unicode(v.decode('utf-8'))
except UnicodeDecodeError:
print "Record %s, %s has encoding error" % (k,v)
scraperwiki.datastore.save(unique_keys=["ref_no"], data=record)
This avoids the error, but is it sensible? Can anyone confirm what encoding the ScraperWiki datastore supports?
Thanks!
The datastore requires either UTF-8 byte strings or Unicode strings.
This example show both ways of saving a pounds sterling currency sign in Python:
http://scraperwiki.com/scrapers/unicode_test/
The same applies in other languages.
You can, for debugging purposes, print non-UTF-8/Unicode strings to the console, and characters it doesn't understand are stripped.