I am working with the sample code for LDA in spark given in https://gist.github.com/jkbradley/ab8ae22a8282b2c8ce33
I have a corpus file, where each line is a document, which I have read using
val corpus: RDD[String] = sc.textFile("C:/corpus.txt")
I also have a ngram file, where each line is a bigram/trigram etc, which I have read using
val ngramFile: RDD[String] = sc.textFile("C:/ngram.txt")
I would like to modify the following line and take only the matching ngram(s) in each document
val tokenized: RDD[Seq[String]] = corpus
.map(_.toLowerCase.split("\\s"))
.map(_.filter(_.length > 3)
.filter(_.forall(java.lang.Character.isLetter))
)
What I have tried doing is
//(Iterate ngramFile each line and match it with the corpus line)
val tokenized= corpus.map( line =>
ngramFile.r.findAllMatchIn(line))
)//this is error :)
So if my corpus file is
Working in Scala Language.
Spark LDA has Scala and Java API.
and my nGram file is:
Scala Language
Spark LDA
Java API
then print of above "tokenized" variable should give me
WrappedArray(scala language)
WrappedArray(spark lda,java api)
instead of the current version of the code
WrappedArray(working,in,scala,language)
WrappedArray(spark,lda,has,scala,and,java,api)
I am new to Scala, hence any help on the above line would be appreciated.
Thanks in advance
I think the problem you are trying to solve is - find out lines in your corpus file which matches list of ngrams in ngram file.
Then, what you need to is:
In case ngrams is a big file, then make it a rdd as well. Also, flatten out first rdd as (,) (ie ngrams in each record). Finally, join these 2 rdds using key.
I am purposefully not giving code here, as I have already answered few questions around ngrams insimilarr contexts here. Apparently its part of some learning excercise and I do not want to ruin the fun :)