My goal is to create a database of MD5 hashes, and then check another list of hashes against it to see if they exist in the database.
I have a working solution using anydbm
, which if you are unfamiliar works exactly like a dictionary in python, but you don't have to load the whole thing in memory at once. Creation of the db is rather slow (100 million entries in roughly 2 and a half hours) and retrieval speed is moderate (100000 entries in 1 to 2 seconds). Checking if the hash exists is as simple as if hash in dbm
.
For a performance improvement, I attempted to make a working solution with sqlite. Creation speed is blazing fast, creating all 100 million entries in a few minutes. But retrieval took upwards of 15 seconds for one entry. This is unreasonable!
I'm not a SQL pro, so I'm wondering if I'm just using commands which overly complicate things.
They are as follows:
creation of table: c.execute('''create table keys(id integer not null primary key autoincrement, hash text, alert text) ''')
adding entries(in a loop): c.execute('''insert into keys(hash, alert) values (?,?) ''', (hash, "1"))
retrieval(also looped):
c.execute('''select * from keys where hash = ? ''', (hash,))
hits = c.fetchall()
numhits += len(hits)
You have to create an index on your md5 table - no matter if the hashes are not unique - not having an index means a linear access throw all the records at each access.
Looking at the docs, it looks like creating your index is as easy as issuing
CREATE INDEX hash ON keys (hash)
on your database.