Python : The most efficient way to join two very big (20+ GB) datasets?

207 views Asked by At

I have a 21 GB dataset df_ns:

domain|ns
123.com|ns1.domanihost.com
hymi.net|ns2.hostinger.com

and another 12 GB dataset df_ip:

ip|domain
28.76.2.2|myname.com
86.90.234.5| 123.com

and I would like to join them on domain name and for the domains that are in both files extract ip and ns.

The way I thought of using it is loading the df_ip data into a dictionary and going through df_ns data line by line and check if the domain is there, then extract the ns. But it is still very resource consuming .

Does anybody have any other, more efficient idea how to do it?

2

There are 2 answers

0
Robᵩ On BEST ANSWER
sort -o df_ns.csv df_ns.csv && \
sort -o df_ip.csv df_ip.csv && \
join -t'|' df_ns.csv df_ip.csv > df_combined.csv

Reference: http://linux.die.net/man/1/join

4
distort86 On

Sort your data by first column, e.g., with gnu sort. After that, you will not need to store your data in memory, just use two iterators like this:

import csv, sys
it1 = (csv.reader(open("df_ns", "r")))
it2 = (csv.reader(open("df_ip", "r")))
# skip the headers
it1.next()
it2.next()
try:
    dm1, ns = it1.next() # first row
except StopIteration:
    sys.exit(0)
try:
    dm2, ip = it2.next()
except StopIteration:
    sys.exit(0)
while True:
    if dm1 == dm2:
        print dm1, ns, ip
    if dm1 < dm2:
        try:
            dm1, ns = it1.next()
        except StopIteration:
            break
        continue
    try: 
        dm2, ip = it2.next()
    except StopIteration:
        break