I have Dataframe Spark and I want to generate Ngrams but the way gensim bigram model does it

48 views Asked by At

I have a text dataframe (tweets), I am using Spark for high volume data handling and I want to generate Bigrams in the same way as Gensim bigrams models do. I have been using Spark NLP for preprocessing the texts but the Bigrams generation with NgramGenerator performs it with all the words in the text and I want it to be considered a Bigram only when a sequence of words is continuously repeated throughout the texts(Gensim). It would be nice if it can be done with Spark NLP tool. If not, Spark MLlib would work for me, the important thing is that we keep the Spark context.

In the context of Gensim, "bigrams" refers to a technique for detecting and working with sequences of two consecutive words, rather than individual words.

I would be very grateful if someone can help me, thanks.

I tried with SparkNLP's NGramGenerator function but the Ngrams generation is done with all the words of each text.

ngrams = NGramGenerator() \
        .setInputCols(["lemmatized"]) \
        .setOutputCol("ngrams") \
        .setN(2) \
        .setEnableCumulative(False)\
        .setDelimiter("_")

I tried this way using UDF with gensim but it is not correct, because the process would be done by chuck(row) and gensim uses the whole column to define the bigrams.

from gensim.models import Phrases
from gensim.models.phrases import Phraser

def generate_bigrams(tokens):
    bigram = Phrases(tokens, min_count=5, threshold=100)
    bigram_phraser = Phraser(bigram)
    return list(bigram_phraser[tokens])

generate_bigrams_udf = udf(generate_bigrams, ArrayType(StringType()))
tweets_bigrams = process.withColumn("bigrams", generate_bigrams_udf(process["lemmatized"]))
0

There are 0 answers