Processing large Pandas Dataframes (fuzzy matching)

791 views Asked by At

I would like to do fuzzy matching where I match strings from a column of a large dataframe (130.000 rows) to a list (400 rows). The code I wrote was tested on a small sample (matching 3000 rows to 400 rows) and works fine. It is too large to copy here but it roughly works like this:

1) data normalization of columns 2) create Cartesian product of columns and calculate Levensthein distance 3) select highest scoring matches and store 'large_csv_name' in seperate list. 4) compare list of 'large_csv_names' to 'large_csv', pull out all the intersecting data and write to a csv.

Because the Cartesian product contains over 50 million records I quickly run into memory errors.

That's why I would like to know how to divide the large dataset up in chunks on which I can then run my script.

So far I have tried:

df_split = np.array_split(df, x (e.g. 50 of 500))
for i in df_split:
  (step 1/4 as above)

As well as:

for chunk in pd.read_csv('large_csv.csv', chunksize= x (e.g. 50 or 500))
  (step 1/4 as above)

None of these methods seem to work. I would like to know how to run the fuzzy matching in chunks, that is cut the large csv up in pieces take a piece, run the code, take a piece, run the code etc.

1

There are 1 answers

0
Michiel V. On

In the meanwhile I wrote a script that slices a dataframe in chunks, each of which is then ready to be processed further. Since I'm new to python the code is probably a bit messy but I still wanted to share it with those who might be stuck with the same problem as I was.

import pandas as pd
import math 


partitions = 3    #number of ways to split df
length = len(df)

list_index = list(df.index.values)
counter = 0     #var that will be used to stop slicing when df ends
block_counter0 = 0      #var which will indicate the begin index of slice                                                              
block_counter1 = block_counter0 + math.ceil(length/partitions)  #likewise
while counter < int(len(list_index)):      #stop slicing when df ends
    df1 = df.iloc[block_counter0:block_counter1]  #temp df that forms chunk
    for i in range(block_counter0, block_counter1 ):

        #insert operations on row of df1 here

    counter += 1  #increase counter by 1 to stop slicing in time
    block_counter0 = block_counter1   #when for loop ends indices areupdated
    if block_counter0 + math.ceil(length / partitions) > 
           int(len(list_index)):
      block_counter1 = len(list_index)
      counter +=1
    else:
      block_counter1 = block_counter0 + math.ceil(length / partitions)