Feed large text to PyTextRank

200 views Asked by At

I would like to use PyTextRank for keyphrase extraction. How can I feed feed 5 million documents (each document consisting of a few paragraphs) to the package?

This is the example I see on the official tutorial.

text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types.\n"
doc = nlp(text)
for phrase in doc._.phrases:
    ic(phrase.rank, phrase.count, phrase.text)
    ic(phrase.chunks)

Is my option only to concatenate several million documents to a single string and pass it to nlp(text)? I do not think I could use nlp.pipe(texts) as I want to create one network by computing words/phrases from all documents.

1

There are 1 answers

0
Paco On BEST ANSWER

No, instead it would almost certainly be better to run these tasks in parallel. Many use cases of pytextrank have used Spark, Dask, Ray, etc., to parallelize running documents through a spaCy pipeline with pytestrank to extract entities. For an example of parallelization with Ray, see https://github.com/Coleridge-Initiative/rclc/blob/4d5347d8d1ac2693901966d6dd6905ba14133f89/bin/index_phrases.py#L45

One question would be how you are associating the extracted entities with documents? Are these being collected into a dataset, or perhaps a database or key/value store?

However these results get collected, you could then construct a graph of co-occurring phrases, and also include additional semantics to help structure the results. A sister project kglab https://github.com/DerwenAI/kglab was created for these kinds of use cases. There are some examples in the Jupyter notebooks included with the kglab project; see https://derwen.ai/docs/kgl/tutorial/

FWIW, we'll have tutorials coming up at ODSC West about using kglab and pytextrank and there are several videos online (under Graph Data Science) for previous tutorials at conferences. We also have monthly public office hours through https://www.knowledgegraph.tech/ – message me @pacoid on Tw for details.