TextRank with Scattertext Visualisation

75 views Asked by At

I recently tried to visualize TextRank using code, but I realized that the terms in the graph are not lemmatized. Is there a way to fix the following code so that all words in textrank_df['parse'] are lemmatized? I checked the pipeline components and all required components are in place ('tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner'), so I'm really not sure where went wrong.

import pytextrank
import spacy
import scattertext as st
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("textrank", last=True)
   
convention_df = textrank_df.assign(
    parse=lambda textrank_df: textrank_df['Combined'].apply(nlp),
)

corpus = st.CorpusFromParsedDocuments(
    convention_df,
    category_col='Response Variable',
    parsed_col='parse',
    feats_from_spacy_doc=st.PyTextRankPhrases()).build()

I tried the following code1, but it shows: AttributeError: module 'pytextrank' has no attribute 'TextRank'. I think it might be something to do with the format after this alteration.

  • code 1

    convention_df = textrank_df.assign( parse=lambda textrank_df: textrank_df['Combined'].apply(lambda x: [token.lemma_ for token in nlp(x)]))

I also tried code 2 which adds use_lemmas=True in PyTextRankPhrases() but did not work as well. The word is still presented in its original form.

  • code 2

    corpus = st.CorpusFromParsedDocuments( convention_df, category_col='Response Variable', parsed_col='parse', feats_from_spacy_doc=st.PyTextRankPhrases(use_lemmas=True)).build()

1

There are 1 answers

1
Paco On

I'm one of the authors of PyTextRank and I've tried out the code shown above.

There are some issues with the usage of scattertext in that example. I don't think the line

convention_df = textrank_df.assign(
    parse=lambda textrank_df: textrank_df['Combined'].apply(nlp),
)

would work correctly. There's no source text defined, from what I can see, and also the textrank_df variable is considered by Python as an undefined value.

Is this code based on the example in scattertext ?https://github.com/JasonKessler/scattertext/blob/master/demo_pytextrank.py

My suggestion would be:

  1. Start with a text source which can be used in a simple spaCy pipeline.
  2. Get the PyTextRank pipeline for spaCy configured and running the way you want it to work.
  3. Then integrate into the declarative pipeline in scattertext and debug that portion.

Might also be good to ask Jason & co. from scattertext for what they'd recommend.