How to match ngrams for each document in Spark LDA code

826 views Asked by At

I am working with the sample code for LDA in spark given in https://gist.github.com/jkbradley/ab8ae22a8282b2c8ce33

I have a corpus file, where each line is a document, which I have read using

val corpus: RDD[String] = sc.textFile("C:/corpus.txt")

I also have a ngram file, where each line is a bigram/trigram etc, which I have read using

val ngramFile: RDD[String] = sc.textFile("C:/ngram.txt")

I would like to modify the following line and take only the matching ngram(s) in each document

val tokenized: RDD[Seq[String]] = corpus
  .map(_.toLowerCase.split("\\s"))
  .map(_.filter(_.length > 3)
        .filter(_.forall(java.lang.Character.isLetter))
  )

What I have tried doing is

 //(Iterate ngramFile each line and match it with the corpus line)
 val tokenized= corpus.map( line => 
   ngramFile.r.findAllMatchIn(line))  
 )//this is error :)

So if my corpus file is

Working in Scala Language.
Spark LDA has Scala and Java API.

and my nGram file is:

Scala Language
Spark LDA
Java API

then print of above "tokenized" variable should give me

WrappedArray(scala language)
WrappedArray(spark lda,java api)

instead of the current version of the code

WrappedArray(working,in,scala,language)
WrappedArray(spark,lda,has,scala,and,java,api)

I am new to Scala, hence any help on the above line would be appreciated.

Thanks in advance

1

There are 1 answers

0
ayan guha On

I think the problem you are trying to solve is - find out lines in your corpus file which matches list of ngrams in ngram file.

Then, what you need to is:

  1. read corpus file. For each line, generate ngrams (this can be achieved by any algo) in a tuple (,). create rdd out of it
  2. read ngram file. if this file is small, create hashmap and broadcast it.
  3. for each record the rdd created in step 1, find out if any of the ngrams are in your list of ngrams hashmap in step 2

In case ngrams is a big file, then make it a rdd as well. Also, flatten out first rdd as (,) (ie ngrams in each record). Finally, join these 2 rdds using key.

I am purposefully not giving code here, as I have already answered few questions around ngrams insimilarr contexts here. Apparently its part of some learning excercise and I do not want to ruin the fun :)