I am sorry that my code might look confusing, but what it does is that it reads in 300,000 items and try to cross-reference them to another file. (It tries to find the best match of the item descriptions from another file).
I know that the library that I am using is pretty slow (rapidfuzz) as it tries to find the extract one best matching description from a list of items for each item. This is why I am trying to use multiprocessing and I am running the code on Jupyter Notebook.
I am still new to multiprocessing so I'm not really sure what the error is about.
Here is my code:
with open('file', 'r') as o:
vis_reader = csv.reader(o,delimiter='|')
vis_items = list(vis_reader)
vis_data = pd.DataFrame(vis_items)
all_other_vis_item_desc = vis_data[1]
vis_item_size = vis_data[2]
choice_mappings = {choice: utils.default_process(choice) for choice in all_vis_item_desc}
def find_match(x):
match_description = process.extractOne(
utils.default_process(x),
choice_mappings,
processor=None,
scorer=fuzz.token_sort_ratio,
score_cutoff=65)
#check if the size match or no
if match_description:
#check if the size match
eby_item_size = re.sub(r'(?<=\d)Z\b', r' oz ',x)
eby_item_size = re.sub(r'/(?=\d)', r' ct ',eby_item_size)
eby_item_size = {x[0].replace(" ", "").lower() for x in findall(regex_size, eby_item_size)}
vis_item_desc = match_description[0]
vis_item_size = vis_items[all_vis_item_desc.index(vis_item_desc)]
vis_item_size = vis_item_size[2]
if(vis_item_size.replace(" ", "").lower() in eby_item_size):
print("size match")
cross_reference_values = vis_items[all_vis_item_desc.index(vis_item_desc)]
print(x)
print(cross_reference_values)
message = "Decription match"
size_match = "size match"
ratio_matching = match_description[1]
elif not eby_item_size:
print("match")
cross_reference_values = vis_items[all_vis_item_desc.index(vis_item_desc)]
print(x)
print(cross_reference_values)
message = "Decription match"
size_match = "review size match"
ratio_matching = match_description[1]
return match_description
import numpy as np
from multiprocessing import Pool
import find_match
num_partitions = 10 #number of partitions to split dataframe
num_cores = 4 #number of cores on your machine
def parallelize(df, func):
print("parallelizing")
df_split = np.array_split(df, num_partitions)
pool = Pool(num_cores)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
#read in the csv file of a list of items to pandas df
eby_data = read_eby1()
data = parallelize(eby_data, find_match)
So I ran this code but I keep on getting this error TypeError: can't pickle module objects
Other questions on StackOverflow about this error has to deal with class but my code does not even contain class, so I'm stuck. Also, I have to put the find_match function is a separate python file because it has something to do with Jupyter Notebook and Multiprocessing.
But basically, I'm trying to speed up this code performance as it is so slow now because I have 300,000 items x thousands of items. I'm still new to python so if you have other suggestions, please suggest. Thanks!