Lazy parsing with Stanford CoreNLP to get sentiment only of specific sentences

1.3k views Asked by At

I am looking for ways to optimize the performance of my Stanford CoreNLP sentiment pipeline. As a result, a want to get sentiment of sentences but only those which contain specific keywords given as an input.

I have tried two approaches:

Approach 1: StanfordCoreNLP pipeline annotating entire text with sentiment

I have defined a pipeline of annotators: tokenize, ssplit, parse, sentiment. I have run it on entire article, then looked for keywords in each sentence and, if they were present, run a method returning keyword value. I was not satisfied though that processing takes a couple of seconds.

This is the code:

List<String> keywords = ...;
String text = ...;
Map<Integer,Integer> sentenceSentiment = new HashMap<>();

Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, parse, sentiment");
props.setProperty("parse.maxlen", "20");
props.setProperty("tokenize.options", "untokenizable=noneDelete");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

Annotation annotation = pipeline.process(text); // takes 2 seconds!!!!
List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
for (int i=0; i<sentences.size(); i++) {
    CoreMap sentence = sentences.get(i);
    if(sentenceContainsKeywords(sentence,keywords) {
        int sentiment = RNNCoreAnnotations.getPredictedClass(sentence.get(SentimentCoreAnnotations.SentimentAnnotatedTree.class));
        sentenceSentiment.put(sentence,sentiment);
    }
}

Approach 2: StanfordCoreNLP pipeline annotating entire text with sentences, separate annotators running on sentences of interest

Because of the weak performance of the first solution, I have defined the second solution. I have defined a pipeline with annotators: tokenize, ssplit. I looked for keywords in each sentence and, if they were present, I have created an annotation only for this sentence and run next annotators on it: ParserAnnotator, BinarizerAnnotator and SentimentAnnotator.

The results were really unsatisfying because of ParserAnnotator. Even if I initialized it with the same properties. Sometimes it took even more time than entire pipeline run on a document in Approach 1.

List<String> keywords = ...;
String text = ...;
Map<Integer,Integer> sentenceSentiment = new HashMap<>();

Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit"); // parsing, sentiment removed
props.setProperty("parse.maxlen", "20");
props.setProperty("tokenize.options", "untokenizable=noneDelete");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

// initiation of annotators to be run on sentences
ParserAnnotator parserAnnotator = new ParserAnnotator("pa", props);
BinarizerAnnotator  binarizerAnnotator = new BinarizerAnnotator("ba", props);
SentimentAnnotator sentimentAnnotator = new SentimentAnnotator("sa", props);

Annotation annotation = pipeline.process(text); // takes <100 ms
List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
for (int i=0; i<sentences.size(); i++) {
    CoreMap sentence = sentences.get(i);
    if(sentenceContainsKeywords(sentence,keywords) {
        // code required to perform annotation on one sentence
        List<CoreMap> listWithSentence = new ArrayList<CoreMap>();
        listWithSentence.add(sentence);
        Annotation sentenceAnnotation  = new Annotation(listWithSentence);

        parserAnnotator.annotate(sentenceAnnotation); // takes 50 ms up to 2 seconds!!!!
        binarizerAnnotator.annotate(sentenceAnnotation);
        sentimentAnnotator.annotate(sentenceAnnotation);
        sentence = sentenceAnnotation.get(CoreAnnotations.SentencesAnnotation.class).get(0);

        int sentiment = RNNCoreAnnotations.getPredictedClass(sentence.get(SentimentCoreAnnotations.SentimentAnnotatedTree.class));
        sentenceSentiment.put(sentence,sentiment);
    }
}

Questions

  1. I wonder why parsing in CoreNLP is not "lazy"? (In my example that would mean: performed only when sentiment on a sentence is called). Is it from performance reasons?

  2. How come a parser for one sentence can work almost as long as a parser for entire article (my article had 7 sentences)? Is it possible to configure it in a way that it works faster?

1

There are 1 answers

2
Jon Gauthier On BEST ANSWER

If you're looking to speed up constituency parsing, the single best improvement is to use the new shift-reduce constituency parser. It is orders of magnitude faster than the default PCFG parser.

Answers to your later questions:

  1. Why is CoreNLP parsing not lazy? This is certainly possible, but not something that we've implemented yet in the pipeline. We likely haven't seen many use cases in-house where this is necessary. We will happily accept a contribution of a "lazy annotator wrapper" if you're interested in making one!
  2. How come a parser for one sentence can work almost as long as a parser for an entire article? The default Stanford PCFG parser is cubic time complexity with respect to the sentence length. This is why we usually recommend restricting the maximum sentence length for performance reasons. The shift-reduce parser, on the other hand, runs in linear time with respect to the sentence length.