Caching large non-unicode dictionary in database?

236 views Asked by At

I have a large dictionary (outputs as string in 366MB, ~383764153 line filetextfile) that I want to store in a database for fast access and to skip the computation time involved in populating the dictionary.

My dictionary consists of a dictionary of dictionaries of filename/contents pairs. Small subset:

    'Reuters/19960916': {
        '54826newsML': '<?xml version="1.0"
encoding="iso-8859-1" ?>\r\n<newsitem itemid="54826" id="root"
date="1996-09-16" xml:lang="en">\r\n<title>USA: RESEARCH ALERT -
Crestar Financial cut.</title>\r\n<headline>RESEARCH ALERT - Crestar
Financial cut.</headline>\r\n<text>\n<p>-- Salomon Brothers analyst
Carole Berger said she cut her rating on Crestar Financial Corp to
hold from buy, at the same time lowering her 1997 earnings per share
view to $5.40 from $5.85.</p>\n<p>-- Crestar said it would buy
Citizens Bancorp in a $774 million stock swap.</p>\n<p>-- Crestar
shares were down 2-1/2 at 58-7/8. Citizens Bancorp soared 14-5/8 to
46-7/8.</p>\n</text>\r\n<copyright>(c) Reuters Limited',
        '55964newsML': '<?xml version="1.0" encoding="iso-8859-1"
?>\r\n<newsitem itemid="55964" id="root" date="1996-09-16"
xml:lang="en">\r\n<title>USA: Nebraska cattle sales thin at

I thought MongoDB would be a good fit, but it looks like it requires both the key and value need to be Unicode, and since I am grabbing the filenames from namelist() on ZipFile it is not guaranteed to be Unicode.

How would you recommend I serialise this dictionary into a database?


There are 2 answers

georg On

pymongo doesn't require strings to be unicode, it actually sends ascii stings as is and encodes unicodes to UTF8. When retrieving data from pymongo, you always get unicode. @@

If your input contains "international" byte strings with high-order bytes (like ab\xC3cd) you need to convert these strings to unicode or encode them as UTF-8. Here's a simple recursive converter that handles arbitrary nested dicts:

def unicode_all(s):
    if isinstance(s, dict):
        return dict((unicode(k), unicode_all(v)) for k, v in s.items())
    if isinstance(s, list):
        return [unicode_all(v) for v in s]
    return unicode(s)
gilesc On

If you have the RAM (and you apparently do, because you populated the dictionary to begin with) -- cPickle. Or if you want something requiring less RAM but would be slower -- shelve.