I have a dataframe of 350k rows and one column (named 'text').
I want to apply this function to my dataset:
def extract_keyphrases(caption, n):
extractor = pke.unsupervised.TopicRank()
extractor.load_document(caption)
extractor.candidate_selection(pos=pos, stoplist=stoplist)
extractor.candidate_weighting(threshold=0.74, method='average')
keyphrases = extractor.get_n_best(n=n, stemming=False)
return(keyphrases)
df['keywords'] = df.apply(lambda row: (extract_keyphrases(row['text'],10)),axis=1)
But if I run it, it takes a lot of time to complete (nearly 50 hours).
It is possible to use chunksize or other methods to parallelize dataframe operations and how?
Thank you for your time!
Use
multiprocessing
module. To avoid an overhead by creating one process per row, each process handles 20,000 rows: