Given:
I have a small sample document with limited number of words as follows:
d ='''
I go to school by the school bus everyday with all of my best friends.
There are several students who also take the buses to school. Buses are quite cheap in my city.
The city which I live in has an enormous number of brilliant schools with smart students.
We have a nice math teacher in my school whose name is Jane Doe.
She also teaches several other topics in our school, including physics, chemistry and sometimes literature as a substitute teacher.
Other classes don't appreciate her efforts as much as my class. She must be nominated as the best school's teacher.
My school is located far from my apartment. This is why, I am taking the bus to school everyday.
'''
Goal:
Considering my real-world large document with more words (4000 ~ 8000 words), I would like to speed up my Stanza lemmatizer by probably excluding lemmatizing repeated words, e.g., words which has occurred more than once.
I do not intend to use set() method to obtain the unique lemmas in my result list, rather I intend to ignore lemmatizing words which have already been lemmatized.
For instance, for the given sample raw document d, there are several redundant words which could be ignored in the process:
Word Lemma
--------------------------------------------------
school school
school school <<<<< Redundant
bus bus
everyday everyday
friends friend
students student
buses bus
school school
Buses bus <<<<< Redundant
cheap cheap
city city
city city <<<<< Redundant
live live
enormous enormous
number number
brilliant brilliant
schools school
smart smart
students student
nice nice
math math
teacher teacher
school school <<<<< Redundant
Jane jane
Doe doe
teaches teach
topics topic
school school <<<<< Redundant
including include
physics physics
chemistry chemistry
literature literature
substitute substitute
teacher teacher <<<<< Redundant
classes class
appreciate appreciate
efforts effort
class class
nominated nominate
school school <<<<< Redundant
teacher teacher
school school <<<<< Redundant
located locate
apartment apartment
bus bus
school school <<<<< Redundant
everyday everyday <<<<< Redundant
My [inefficient] solution:
import stanza
import nltk
nltk_modules = ['punkt',
'averaged_perceptron_tagger',
'stopwords',
'wordnet',
'omw-1.4',
]
nltk.download(nltk_modules, quiet=True, raise_on_error=True,)
STOPWORDS = nltk.corpus.stopwords.words(nltk.corpus.stopwords.fileids())
nlp = stanza.Pipeline(lang='en', processors='tokenize,lemma,pos', tokenize_no_ssplit=True,download_method=DownloadMethod.REUSE_RESOURCES)
doc = nlp(d)
%timeit -n 10000 [ wlm.lower() for _, s in enumerate(doc.sentences) for _, w in enumerate(s.words) if (wlm:=w.lemma) and len(wlm)>2 and wlm not in STOPWORDS]
10.5 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
My [alternative] solution, a little faster but still NOT efficient for (4000 ~ 8000 words):
def get_lm():
words_list = list()
lemmas_list = list()
for _, vsnt in enumerate(doc.sentences):
for _, vw in enumerate(vsnt.words):
wlm = vw.lemma.lower()
wtxt = vw.text.lower()
if wtxt in words_list and wlm in lemmas_list:
lemmas_list.append(wlm)
elif ( wtxt not in words_list and wlm and len(wlm) > 2 and wlm not in STOPWORDS ):
lemmas_list.append(wlm)
words_list.append(wtxt)
return lemmas_list
%timeit -n 10000 get_lm()
7.85 ms ± 66.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
My ideal result for this sample document, from either solution, should look like this, containing even repeated lemmas:
lm = [ wlm.lower() for _, s in enumerate(doc.sentences) for _, w in enumerate(s.words) if (wlm:=w.lemma) and len(wlm)>2 and wlm not in STOPWORDS] # solution 1
# lm = get_lm() # solution 2
print(len(lm), lm)
47 ['school', 'school', 'bus', 'everyday', 'friend', 'student', 'bus', 'school', 'bus', 'cheap', 'city', 'city', 'live', 'enormous', 'number', 'brilliant', 'school', 'smart', 'student', 'nice', 'math', 'teacher', 'school', 'jane', 'doe', 'teach', 'topic', 'school', 'include', 'physics', 'chemistry', 'literature', 'substitute', 'teacher', 'class', 'appreciate', 'effort', 'class', 'nominate', 'school', 'teacher', 'school', 'locate', 'apartment', 'bus', 'school', 'everyday']
Is there any better or more efficient approach for this problem when considering large corpus or documents?
Cheers,