I would like to do fuzzy matching where I match strings from a column of a large dataframe (130.000 rows) to a list (400 rows). The code I wrote was tested on a small sample (matching 3000 rows to 400 rows) and works fine. It is too large to copy here but it roughly works like this:
1) data normalization of columns 2) create Cartesian product of columns and calculate Levensthein distance 3) select highest scoring matches and store 'large_csv_name' in seperate list. 4) compare list of 'large_csv_names' to 'large_csv', pull out all the intersecting data and write to a csv.
Because the Cartesian product contains over 50 million records I quickly run into memory errors.
That's why I would like to know how to divide the large dataset up in chunks on which I can then run my script.
So far I have tried:
df_split = np.array_split(df, x (e.g. 50 of 500))
for i in df_split:
(step 1/4 as above)
As well as:
for chunk in pd.read_csv('large_csv.csv', chunksize= x (e.g. 50 or 500))
(step 1/4 as above)
None of these methods seem to work. I would like to know how to run the fuzzy matching in chunks, that is cut the large csv up in pieces take a piece, run the code, take a piece, run the code etc.
In the meanwhile I wrote a script that slices a dataframe in chunks, each of which is then ready to be processed further. Since I'm new to python the code is probably a bit messy but I still wanted to share it with those who might be stuck with the same problem as I was.