Getting RuntimeError: generator raised StopIteration while generating bigrams using nltk module

46 views Asked by At

I am trying to generate bigrams using nltk.ngrams but getting the RuntimeError: generator raised StopIteration error. How do I fix this error in regards to my specific problem?

My dataframe df has multiple columns out of which only two, namely, FonctionsStagiaire and ExigencesParticulieres are of interest to me. The two columns look like this as tokens before unigrams and bigrams are generated.

FonctionsStagiaire ExigencesParticulieres
[fall, 2022, human, resources, training, inter... [required, skills, what, you, need, to, succee...
[specifically, seeking, software, engineers, r... [for, game, mechanics, core, engine, tools, li...
[what, do, you, value, in, a, career, at, agni... [bachelor's, degree, student, electrical, engi...
[your, contribution, reporting, reliability, s... [fluent, english, speaking, skills, able, swim...

The output of df[['FonctionsStagiaire', 'ExigencesParticulieres']].head(4).to_dict()

{'FonctionsStagiaire': {21: ['fall',
   '2022',
   'human',
   'resources',
   'training',
   'internship',
   'remote',
   'mea01229',
   'what',
   'do',
   'you',
   'value',
   'in',
   'a',
   'career',
   'at',
   'agnico',
   'eagle',
   'values',
   'never',
   'waver',
   'we',
   'believe',
   'trust',
   'respect',
   'equality',
   'family',
   'responsibility',
   'why',
   'because',
   'express',
   'helped',
   'us',
   'succeed',
   'business',
   '60',
   'years',
   'about',
   'meadowbank',
   'our',
   'nunavut',
   'operations',
   'agnico',
   'eagle',
   'always',
   'looking',
   'new',
   'talented',
   'team',
   'members',
   'join',
   'nunavut',
   'mining',
   'operations',
   'we',
   'operating',
   'meadowbank',
   'first',
   'low',
   'arctic',
   'mine',
   'near',
   'baker',
   'lake',
   'nine',
   'years',
   'the',
   'mine',
   'produced',
   'three',
   'millionth',
   'ounce',
   'gold',
   '2018',
   '2019',
   'marked',
   'last',
   'year',
   'production',
   'meadowbank',
   'mine',
   'since',
   'transitioned',
   'process',
   'ore',
   'amaruq',
   'satellite',
   'deposit',
   'with',
   'official',
   'opening',
   'amaruq',
   'whale',
   'tail',
   'project',
   'august',
   '2019',
   'project',
   'referred',
   'meadowbank',
   'complex',
   'your',
   'contribution',
   'reporting',
   'training',
   'coordinator',
   'training',
   'intern',
   'part',
   'people',
   'development',
   'department',
   'collaborates',
   'departments',
   'mine',
   'shehe',
   'ensure',
   'goals',
   'objectives',
   'achieved',
   'promoting',
   'respecting',
   'agnico',
   "eagle's",
   'culture',
   'health',
   'safety',
   'code',
   'conduct',
   'environment',
   'coordinate',
   'manage',
   'projects',
   'related',
   'training',
   'northern',
   'mining',
   'environment',
   'develop',
   'modify',
   'training',
   'content',
   'assist',
   'planning',
   'tracking',
   'training',
   'activities',
   'develop',
   'maintain',
   'effective',
   'training',
   'materials',
   'assist',
   'management',
   'elearning',
   'training',
   'platform',
   'participate',
   'creation',
   'training',
   'practices',
   'procedures',
   'your',
   'work',
   'schedule',
   'schedule',
   '14',
   'days',
   'work',
   '12',
   'hour',
   'shifts',
   'followed',
   '14',
   'days',
   'transportation',
   'rest',
   'flights',
   'departing',
   'communities',
   'kivalliq',
   'region',
   'mirabel',
   "vald'or",
   'quebec',
   'travel',
   'room',
   'board',
   'provided',
   'agnico',
   'eagle',
   'to',
   'apply',
   'position',
   'please',
   'use',
   'following',
   'url',
   'httpsars2equestcomresponse_id1cf099f2f50a501d123d332ae1931084'],
  22: ['fall',
   '2022',
   'mechanical',
   'engineering',
   'mobile',
   'maintenance',
   'planner',
   'intern',
   'remote',
   'mea01233',
   'your',
   'contribution',
   'reporting',
   'reliability',
   'specialist',
   'part',
   'maintenance',
   'department',
   'collaborates',
   'departments',
   'mine',
   'heshe',
   'ensure',
   'goals',
   'objectives',
   'achieved',
   'promoting',
   'respecting',
   'agnico',
   "eagle's",
   'culture',
   'health',
   'safety',
   'code',
   'conduct',
   'environment',
   'your',
   'task',
   'be',
   'optimize',
   'implement',
   'preventive',
   'maintenance',
   'plans',
   'monitor',
   'oil',
   'samples',
   'fleet',
   'schedule',
   'preventive',
   'replacement',
   'benchmarked',
   'components',
   'update',
   'prediction',
   'report',
   'be',
   'responsible',
   'various',
   'optimization',
   'projects',
   'maintenance',
   'department',
   'monitor',
   '23',
   'new',
   'long',
   'haul',
   'trucks',
   'mine',
   'acquired',
   'support',
   'reliability',
   'specialist',
   'different',
   'tasks',
   'your',
   'work',
   'schedule',
   'schedule',
   '14',
   'days',
   'work',
   'followed',
   '14',
   'days',
   'transportation',
   'rest',
   'flights',
   'departing',
   'communities',
   'kivalliq',
   'region',
   'mirabel',
   "vald'or",
   'quebec',
   'to',
   'apply',
   'position',
   'please',
   'use',
   'following',
   'url',
   'httpsars2equestcomresponse_ide5baa1f06dae8d5dcf05e2b228680085'],
  23: ['fall',
   '2022',
   'mine',
   'engineering',
   'drill',
   'blast',
   'internship',
   'remote',
   'mea01235',
   'your',
   'contribution',
   'reporting',
   'production',
   'engineering',
   'coordinator',
   'production',
   'engineering',
   'intern',
   'part',
   'engineering',
   'department',
   'collaborates',
   'departments',
   'mine',
   'shehe',
   'ensure',
   'goals',
   'objectives',
   'achieved',
   'promoting',
   'respecting',
   'agnico',
   "eagle's",
   'culture',
   'health',
   'safety',
   'code',
   'conduct',
   'environment',
   'there',
   'two',
   'primary',
   'tasks',
   'engineering',
   'intern',
   'the',
   'first',
   'task',
   'position',
   'fulfilling',
   'quality',
   'assurancequality',
   'control',
   'qaqc',
   'duties',
   'drill',
   'blast',
   'by',
   'performing',
   'qaqc',
   'drill',
   'blast',
   'patterns',
   'field',
   'henceforth',
   'referred',
   'qaqc',
   'drilling',
   'loading',
   'collecting',
   'compiling',
   'qaqc',
   'data',
   'communicating',
   'qaqc',
   'data',
   'engineering',
   'team',
   'the',
   'second',
   'task',
   'position',
   'performing',
   'fragmentation',
   'analysis',
   'drill',
   'blast',
   'engineers',
   'by',
   'taking',
   'pictures',
   'muck',
   'faces',
   'field',
   'performing',
   'split',
   'desktop',
   'analysis',
   'pictures',
   'communicating',
   'fragmentation',
   'results',
   'engineering',
   'team',
   'primary',
   'duties',
   'tracking',
   'progression',
   'drilling',
   'mucking',
   'morning',
   'meeting',
   'ensure',
   'priorities',
   'qaqc',
   'fragmentation',
   'analysis',
   'met',
   'quality',
   'assurancequality',
   'control',
   'drilling',
   'loading',
   'practices',
   'field',
   'fragmentation',
   'analysis',
   'active',
   'mucking',
   'faces',
   'promote',
   'health',
   'safety',
   'participating',
   'monthly',
   'departmental',
   'hs',
   'meeting',
   'secondary',
   'duties',
   'provide',
   'technical',
   'support',
   'drill',
   'blast',
   'engineer',
   'blast',
   'optimization',
   'project',
   'proposals',
   'or',
   'needsinterests',
   'floor',
   'analysis',
   'muck',
   'floor',
   'water',
   'presence',
   'drill',
   'patterns',
   'loading',
   'statistics',
   'powder',
   'factor',
   'analysis',
   'provide',
   'relief',
   'support',
   'mine',
   'clerk',
   'vacations',
   'special',
   'projects',
   'according',
   'needs',
   'engineering',
   'mine',
   'department',
   'providing',
   'relief',
   'support',
   'engineering',
   'team',
   'vacations',
   'drill',
   'pattern',
   'design',
   'blast',
   'timing',
   'design',
   'your',
   'work',
   'schedule',
   'schedule',
   '14',
   'days',
   'work',
   'followed',
   '14',
   'days',
   'transportation',
   'rest',
   'flights',
   'departing',
   'communities',
   'kivalliq',
   'region',
   'mirabel',
   "vald'or",
   'quebec',
   'to',
   'apply',
   'position',
   'please',
   'use',
   'following',
   'url',
   'httpsars2equestcomresponse_id0fc30561bcc0715fe84e7690663f1bc8'],
  24: ['what',
   'do',
   'you',
   'value',
   'in',
   'a',
   'career',
   'at',
   'agnico',
   'eagle',
   'values',
   'never',
   'waver',
   'we',
   'believe',
   'trust',
   'respect',
   'equality',
   'family',
   'responsibility',
   'why',
   'because',
   'express',
   'helped',
   'us',
   'succeed',
   'business',
   '60',
   'years',
   'about',
   'meadowbank',
   'our',
   'nunavut',
   'operations',
   'agnico',
   'eagle',
   'always',
   'looking',
   'new',
   'talented',
   'team',
   'members',
   'join',
   'nunavut',
   'mining',
   'operations',
   'we',
   'operating',
   'meadowbank',
   'first',
   'low',
   'arctic',
   'mine',
   'near',
   'baker',
   'lake',
   'nine',
   'years',
   'the',
   'mine',
   'produced',
   'three',
   'millionth',
   'ounce',
   'gold',
   '2018',
   '2019',
   'marked',
   'last',
   'year',
   'production',
   'meadowbank',
   'mine',
   'since',
   'transitioned',
   'process',
   'ore',
   'amaruq',
   'satellite',
   'deposit',
   'with',
   'official',
   'opening',
   'amaruq',
   'whale',
   'tail',
   'project',
   'august',
   '2019',
   'project',
   'referred',
   'meadowbank',
   'complex',
   'your',
   'contribution',
   'reporting',
   'senior',
   'grade',
   'control',
   'technician',
   'geology',
   'intern',
   'part',
   'mine',
   'geology',
   'department',
   'collaborates',
   'departments',
   'mine',
   'shehe',
   'ensure',
   'goals',
   'objectives',
   'achieved',
   'promoting',
   'respecting',
   'agnico',
   "eagle's",
   'culture',
   'health',
   'safety',
   'code',
   'conduct',
   'environment',
   'drill',
   'blast',
   'excavation',
   'monitoring',
   'regards',
   'mine',
   'geology',
   'grade',
   'control',
   'standards',
   'qaqc',
   'field',
   'regular',
   'audits',
   'quality',
   'sampling',
   'layout',
   'ore',
   'packets',
   'blasted',
   'muck',
   'define',
   'different',
   'ore',
   'zones',
   'mining',
   'daily',
   'sample',
   'collection',
   'mine',
   'shipment',
   'lab',
   'monitoring',
   'any',
   'tasks',
   'senior',
   'grade',
   'control',
   'andor',
   'production',
   'geologist',
   'might',
   'identify',
   'position',
   'approximately',
   '90',
   'field',
   'work',
   '10',
   'office',
   'work',
   'your',
   'work',
   'schedule',
   'schedule',
   '14',
   'days',
   'work',
   'followed',
   '14',
   'days',
   'transportation',
   'rest',
   'flights',
   'departing',
   'communities',
   'kivalliq',
   'region',
   'mirabel',
   "vald'or",
   'quebec',
   'travel',
   'room',
   'board',
   'provided',
   'agnico',
   'eagle',
   'to',
   'apply',
   'position',
   'please',
   'use',
   'following',
   'url',
   'httpsars2equestcomresponse_idb79a50f3edc987d26ffdf6568a7c1604']},
 'ExigencesParticulieres': {21: ['required',
   'skills',
   'what',
   'you',
   'need',
   'to',
   'succeed',
   'enrolled',
   'graduated',
   "bachelor's",
   'degree',
   'human',
   'resources',
   'administration',
   'management',
   'industrial',
   'relations',
   'related',
   'field',
   'mining',
   'experience',
   'asset',
   'strong',
   'sense',
   'organization',
   'quick',
   'learner',
   'experience',
   'working',
   'multicultural',
   'environment',
   'asset',
   'excellent',
   'communication',
   'skills',
   'english',
   'written',
   'spoken',
   'must',
   'strong',
   'interpersonal',
   'communication',
   'team',
   'building',
   'skills',
   'strong',
   'computer',
   'skills',
   'including',
   'use',
   'word',
   'excel',
   'powerpoint'],
  22: ['required',
   'skills',
   'what',
   'you',
   'need',
   'to',
   'succeed',
   'being',
   'autonomous',
   'proactive',
   'valuable',
   'must',
   'quick',
   'learner',
   'organizational',
   'skills',
   'required',
   'enrolled',
   'graduated',
   "bachelor's",
   'degree',
   'mechanical',
   'mining',
   'engineering',
   'related',
   'field',
   'mining',
   'experience',
   'asset',
   'underground',
   'experience',
   'asset',
   'experience',
   'working',
   'multicultural',
   'environment',
   'asset',
   'excellent',
   'communication',
   'skills',
   'english',
   'written',
   'spoken',
   'must',
   'strong',
   'interpersonal',
   'communication',
   'team',
   'building',
   'skills'],
  23: ['required',
   'skills',
   'what',
   'you',
   'need',
   'to',
   'succeed',
   'valid',
   "driver's",
   'license',
   'enrolled',
   'graduated',
   "bachelor's",
   'degree',
   'mining',
   'engineering',
   'related',
   'field',
   'mining',
   'experience',
   'asset',
   'experience',
   'working',
   'multicultural',
   'environment',
   'asset',
   'excellent',
   'communication',
   'skills',
   'english',
   'written',
   'spoken',
   'must',
   'strong',
   'interpersonal',
   'communication',
   'team',
   'building',
   'skills'],
  24: ['required',
   'skills',
   'what',
   'you',
   'need',
   'to',
   'succeed',
   'enrolled',
   'graduated',
   "bachelor's",
   'degree',
   'kind',
   'geosciences',
   'related',
   'fields',
   'mining',
   'experience',
   'asset',
   'experience',
   'working',
   'multicultural',
   'environment',
   'asset',
   'excellent',
   'communication',
   'skills',
   'english',
   'written',
   'spoken',
   'must',
   'strong',
   'interpersonal',
   'communication',
   'team',
   'building',
   'skills']}}

The code for generating the unigrams and bigrams and their counts

from nltk.util import ngrams
from nltk import FreqDist
from collections import Counter

col_list = ['FonctionsStagiaire', 'ExigencesParticulieres']


for col in col_list:
    
    df[col+'_unigrams'] = df[col].apply(lambda row: list(nltk.ngrams(row, 1))) 
    #try:
    df[col+'_bigrams'] = df[col].apply(lambda row: list(nltk.ngrams(row, 2)))
    #except RuntimeError:
    #    for i in df.index:
    #            print(df.index)
    df[col+'_unigrams_freq_dist'] = df[col+'_unigrams'].apply(lambda row: list(nltk.FreqDist(row)))
    df[col+'_bigrams_freq_dist'] = df[col+'_bigrams'].apply(lambda row: list(nltk.FreqDist(row)))
    df[col+'_unigrams_counts'] = df[col+'_unigrams_freq_dist'].apply(lambda row: list(Counter(row).most_common()))
    df[col+'_bigrams_counts'] = df[col+'_bigrams_freq_dist'].apply(lambda row: list(Counter(row).most_common()))

The error in more details

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
File /opt/conda/lib/python3.10/site-packages/nltk/util.py:468, in ngrams(sequence, n, pad_left, pad_right, left_pad_symbol, right_pad_symbol)
    467 while n > 1:
--> 468     history.append(next(sequence))
    469     n -= 1

StopIteration: 

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
Cell In[100], line 12
     10 df[col+'_unigrams'] = df[col].apply(lambda row: list(nltk.ngrams(row, 1))) 
     11 #try:
---> 12 df[col+'_bigrams'] = df[col].apply(lambda row: list(nltk.ngrams(row, 2)))
     13 #except RuntimeError:
     14 #    for i in df.index:
     15 #            print(df.index)
     16 df[col+'_unigrams_freq_dist'] = df[col+'_unigrams'].apply(lambda row: list(nltk.FreqDist(row)))

File /opt/conda/lib/python3.10/site-packages/pandas/core/series.py:4771, in Series.apply(self, func, convert_dtype, args, **kwargs)
   4661 def apply(
   4662     self,
   4663     func: AggFuncType,
   (...)
   4666     **kwargs,
   4667 ) -> DataFrame | Series:
   4668     """
   4669     Invoke function on values of Series.
   4670 
   (...)
   4769     dtype: float64
   4770     """
-> 4771     return SeriesApply(self, func, convert_dtype, args, kwargs).apply()

File /opt/conda/lib/python3.10/site-packages/pandas/core/apply.py:1123, in SeriesApply.apply(self)
   1120     return self.apply_str()
   1122 # self.f is Callable
-> 1123 return self.apply_standard()

File /opt/conda/lib/python3.10/site-packages/pandas/core/apply.py:1174, in SeriesApply.apply_standard(self)
   1172     else:
   1173         values = obj.astype(object)._values
-> 1174         mapped = lib.map_infer(
   1175             values,
   1176             f,
   1177             convert=self.convert_dtype,
   1178         )
   1180 if len(mapped) and isinstance(mapped[0], ABCSeries):
   1181     # GH#43986 Need to do list(mapped) in order to get treated as nested
   1182     #  See also GH#25959 regarding EA support
   1183     return obj._constructor_expanddim(list(mapped), index=obj.index)

File /opt/conda/lib/python3.10/site-packages/pandas/_libs/lib.pyx:2924, in pandas._libs.lib.map_infer()

Cell In[100], line 12, in <lambda>(row)
     10 df[col+'_unigrams'] = df[col].apply(lambda row: list(nltk.ngrams(row, 1))) 
     11 #try:
---> 12 df[col+'_bigrams'] = df[col].apply(lambda row: list(nltk.ngrams(row, 2)))
     13 #except RuntimeError:
     14 #    for i in df.index:
     15 #            print(df.index)
     16 df[col+'_unigrams_freq_dist'] = df[col+'_unigrams'].apply(lambda row: list(nltk.FreqDist(row)))

RuntimeError: generator raised StopIteration

I am not able to understand why this error is being thrown. I have run the same code on other datasets as well and it has run fine on each one of them.

Any help would be appreciated.

0

There are 0 answers