How to use JohnSnowLabs NLP Spell correction module NorvigSweetingModel?

Question

How to use JohnSnowLabs NLP Spell correction module NorvigSweetingModel?

1.2k views Asked by user3243499 At 21 November 2018 at 18:15

I was going through the JohnSnowLabs SpellChecker here.

I found the Norvig's algorithm implementation there, and the example section has just the following two lines:

import com.johnsnowlabs.nlp.annotator.NorvigSweetingModel
NorvigSweetingModel.pretrained()

How can I apply this pretrained model on my dataframe (df)below for spell correcting the "names" column?

+----------------+---+------------+
|           names|age|       color|
+----------------+---+------------+
|      [abc, cde]| 19|    red, abc|
|[eefg, efa, efb]|192|efg, efz efz|
+----------------+---+------------+

I have tried to do it as follows:

val schk = NorvigSweetingModel.pretrained().setInputCols("names").setOutputCol("Corrected")

val cdf = schk.transform(df)

But the above code gave me the following error:

java.lang.IllegalArgumentException: requirement failed: Wrong or missing inputCols annotators in SPELL_a1f11bacb851. Received inputCols: names. Make sure such columns have following annotator types: token
  at scala.Predef$.require(Predef.scala:224)
  at com.johnsnowlabs.nlp.AnnotatorModel.transform(AnnotatorModel.scala:51)
  ... 49 elided

Original Q&A

There are 1 answers

**10465355** · Accepted Answer · 2018-11-21T19:04:57+00:00

spark-nlp are designed to be used in its own specific pipelines and input columns for different transformers have to include special metadata.

The exception already tells you that input to the NorvigSweetingModel should be tokenized:

Make sure such columns have following annotator types: token

If I am not mistaken, at minimum you'll have assemble documents and tokenized here.

import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.NorvigSweetingModel
import com.johnsnowlabs.nlp.annotators.Tokenizer
import org.apache.spark.ml.Pipeline

val df = Seq(Seq("abc", "cde"), Seq("eefg", "efa", "efb")).toDF("names")

val nlpPipeline = new Pipeline().setStages(Array(
  new DocumentAssembler().setInputCol("names").setOutputCol("document"),
  new Tokenizer().setInputCols("document").setOutputCol("tokens"),
  NorvigSweetingModel.pretrained().setInputCols("tokens").setOutputCol("corrected")
))

A Pipeline like this, can be applied on your data with small adjustment - input data has to be string not array<string>*:

val result = df
  .transform(_.withColumn("names", concat_ws(" ", $"names")))
  .transform(df => nlpPipeline.fit(df).transform(df))
result.show()

+------------+--------------------+--------------------+--------------------+
|       names|            document|              tokens|           corrected|
+------------+--------------------+--------------------+--------------------+
|     abc cde|[[document, 0, 6,...|[[token, 0, 2, ab...|[[token, 0, 2, ab...|
|eefg efa efb|[[document, 0, 11...|[[token, 0, 3, ee...|[[token, 0, 3, ee...|
+------------+--------------------+--------------------+--------------------+

If you want an output that can be exported you should extend your Pipeline with Finisher.

import com.johnsnowlabs.nlp.Finisher

new Finisher().setInputCols("corrected").transform(result).show

 +------------+------------------+
 |       names|finished_corrected|
 +------------+------------------+
 |     abc cde|        [abc, cde]|
 |eefg efa efb|  [eefg, efa, efb]|
 +------------+------------------+

* According to the docs DocumentAssembler

can read either a String column or an Array[String]

but it doesn't look like it works in practice in 1.7.3:

df.transform(df => nlpPipeline.fit(df).transform(df)).show()

org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(names)' due to data type mismatch: argument 1 requires string type, however, '`names`' is of array<string> type.;;
'Project [names#62, UDF(names#62) AS document#343]
+- AnalysisBarrier
      +- Project [value#60 AS names#62]
         +- LocalRelation [value#60]

TechQA.

How to use JohnSnowLabs NLP Spell correction module NorvigSweetingModel?

There are 1 answers

Related Questions in SCALA

Related Questions in APACHE-SPARK

Related Questions in NLP

Related Questions in APACHE-SPARK-ML

Related Questions in JOHNSNOWLABS-SPARK-NLP

Popular Questions

Trending Questions