I am summarizing documents using TextRank pipeline in SpaCy. I need to summarize both long and short documents. Can you suggest a good approach to choose the right parameter of limit_phrases?
this is the approach I am currently using, but I am sure it can be improved:
import spacy
import pytextrank
nlp = spacy.load(spacy_model)
nlp.add_pipe("textrank", last=True)
# Process the input text
doc = nlp(text)
doc_sentences = len(list(doc.sents))
print(f'Number of document sentences = {doc_sentences}')
limit_sentences = int(doc_sentences * percentage)
limit_phrases = int(limit_sentences * 2)
top_sentences = doc._.textrank.summary(limit_phrases=limit_phrases, limit_sentences=limit_sentences, preserve_order=True)
The optimal values for
limit_phraseswill depend strongly on your content. Do you have any kind of benchmark against which you could run test, essentially doing a grid search to find a reasonable setting for this parameter?FWIW, I'm one of the authors of
pytextrank, and this is really good question. There's no analytic way to determining how to set this parameter, as far as our team knows.