I am trying to include frequently shown bigrams into a set of unigram tokens using Gensim Phrases function but here I am stuck at the last stage.
What I am getting as an output is shown below (Having) where all the tokens are further broken down into a character-level and some of the characters are paired up (e.g. y_o).
But what I want to see as an output is shown below (Want).
In other words:
(1) from the 'col1' raw strings in a Pandas Dataframe format, remove stop-words and save the output in 'col2'.
(2) Then, generate bigrams using Gensim Phrases and save the output in 'col3'.
(3) add the outputs of 'col2' and 'col3' together into a 'col4' but keep all the outputs from 'col2' while you only include bigrams from 'col3'.
Which part I am having a wrong code? Please see my codes below.
import pandas as pd
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim.models import Phrases
#example data.
data = {
"col1": ['the mayor of new york was there machine learning good place',
'good place machine learning can be useful sometimes in new york', 'new york mayor was present new york machine learning new york']}
#load data into a DataFrame object.
df = pd.DataFrame(data)
#remove stop-words using simple_preprocess.
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS:
result.append(token)
return result
#apply the above function.
df['col2']=df['col1'].map(preprocess)
#build a bigram model using Phrases.
def birams(texts):
bigram = gensim.models.Phrases(texts, min_count=1, threshold=1)
bigram_mod = gensim.models.phrases.Phraser(bigram)
return [bigram_mod[doc] for doc in texts]
#apply the above function.
df['col3']=df['col2'].map(birams)
print (df)
Having:
col1 \
0 the mayor of new york was there machine learning good place
1 good place machine learning can be useful sometimes in new york
2 new york mayor was present new york machine learning new york
col2 \
0 ['mayor', 'new', 'york', 'machine', 'learning', 'good', 'place']
1 ['good', 'place', 'machine', 'learning', 'useful', 'new', 'york']
2 ['new', 'york', 'mayor', 'present', 'new', 'york', 'machine', 'learning', 'new', 'york']
col3
0 [[m_a, y_o, r], [n_e, w], [y_o, r, k], [m_a, c...
1 [[g, o, o, d], [p, l, a_c, e], [m, a_c, h, i_n...
2 [[n_e, w], [y_o, r_k], [m_a, y_o, r], [p, r, e...
Want:
col1 \
0 the mayor of new york was there machine learning good place
1 good place machine learning can be useful sometimes in new york
2 new york mayor was present new york machine learning new york
col2 \
0 ['mayor', 'new', 'york', 'machine', 'learning', 'good', 'place']
1 ['good', 'place', 'machine', 'learning', 'useful', 'new', 'york']
2 ['new', 'york', 'mayor', 'present', 'new', 'york', 'machine', 'learning', 'new', 'york']
col3 \
0 ['mayor', 'new_york', 'machine_learning', 'good_place']
1 ['good_place', 'machine_learning', 'useful', 'new_york']
2 ['new_york', 'mayor', 'present', 'new_york', 'machine_learning', 'new_york']
col4
0 ['mayor', 'new', 'york', 'machine', 'learning', 'good', 'place', 'new_york', 'machine_learning', 'good_place']
1 ['good', 'place', 'machine', 'learning', 'useful', 'new', 'york','good_place', 'machine_learning', 'new_york']
2 ['new', 'york', 'mayor', 'present', 'new', 'york', 'machine', 'learning', 'new', 'york', 'new_york', 'new_york', 'machine_learning', 'new_york']
First problem: your
maponto the data means more than onePhrasesmodel is being trained - each with only a single one of your texts.Which triggers the second problem: each of your texts is a list of individual words. But
Phrasesexpects one entire corpus – a Python re-iterable sequence (such as alist) that itself has, as each item, a list of words. So you're instead passing in a corpus that's just a list of words – so each word looks like a list of single-character tokens.I recommend eliminating Pandas structures entirely from this part of your project. It adds extra overhead and indirection. Use plain Python data structures, and create only one
Phrasesthat works on your entire (preprocessed, tokenized) corpus.