Tokenization by Stanford parser is slow?

960 views Asked by At

Quession Summary: tokenization by stanford parser is slow on my local machine, but unreasonably much much faster on spark. Why?


I'm using stanford coreNLP tool to tokenize sentences.

My script in Scala is like this:

import java.util.Properties
import scala.collection.JavaConversions._ 
import scala.collection.immutable.ListMap
import scala.io.Source

import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation
import edu.stanford.nlp.ling.CoreLabel
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.pipeline.StanfordCoreNLP
val properties = new Properties()
val coreNLP = new StanfordCoreNLP(properties)

def tokenize(s: String)  = { 
  properties.setProperty("annotators", "tokenize")
  val annotation = new Annotation(s)
  coreNLP.annotate(annotation)
  annotation.get(classOf[TokensAnnotation]).map(_.value.toString)
}

tokenize("Here is my sentence.")

One call of tokenize function takes roughly (at least) 0.1 sec. This is very very slow because I have 3 million sentences. (3M * 0.1sec = 300K sec = 5000H)


As an alternative approach, I have applied the tokenizer on Spark. (with four worker machines.)

import java.util.List
import java.util.Properties
import scala.collection.JavaConversions._  
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation
import edu.stanford.nlp.ling.CoreLabel
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.pipeline.StanfordCoreNLP

val file = sc.textFile("hdfs:///myfiles")

def tokenize(s: String)  = { 
  val properties = new Properties()
  properties.setProperty("annotators", "tokenize")
  val coreNLP = new StanfordCoreNLP(properties)
  val annotation = new Annotation(s)
  coreNLP.annotate(annotation)
  annotation.get(classOf[TokensAnnotation]).map(_.toString)
}

def normalizeToken(t: String) = {
  val ts = t.toLowerCase
  val num = "[0-9]+[,0-9]*".r
  ts match {
    case num() => "NUMBER"
    case _ => ts
  }
}

val tokens = file.map(tokenize(_))
val tokenList = tokens.flatMap(_.map(normalizeToken))
val wordCount = tokenList.map((_,1)).reduceByKey(_ + _).sortBy(_._2, false)
wordCount.saveAsTextFile("wordcount")

This scripts finishes tokenization and word count of 3 million sentences just in 5 minites! And results seems reasonable. Why this is so first? Or, why the first scala script is so slow?

1

There are 1 answers

0
Sebastian Schuster On

The problem with your first approach is that you set the annotators property after you initialize the StanfordCoreNLP object. Therefore CoreNLP is initialized with the list of default annotators which include the part-of-speech tagger and the parser which are orders of magnitude slower than the tokenizer.

To fix this, simply move the line

properties.setProperty("annotators", "tokenize")

before the line

val coreNLP = new StanfordCoreNLP(properties)

This should be even slightly faster than your second approach as you don't have to reinitialize CoreNLP for each sentence.