I have a 21 GB dataset df_ns:
domain|ns
123.com|ns1.domanihost.com
hymi.net|ns2.hostinger.com
and another 12 GB dataset df_ip:
ip|domain
28.76.2.2|myname.com
86.90.234.5| 123.com
and I would like to join them on domain name and for the domains that are in both files extract ip and ns.
The way I thought of using it is loading the df_ip data into a dictionary and going through df_ns data line by line and check if the domain is there, then extract the ns. But it is still very resource consuming .
Does anybody have any other, more efficient idea how to do it?
Reference: http://linux.die.net/man/1/join