I am looking for ways to optimize the performance of my Stanford CoreNLP sentiment pipeline. As a result, a want to get sentiment of sentences but only those which contain specific keywords given as an input.
I have tried two approaches:
Approach 1: StanfordCoreNLP pipeline annotating entire text with sentiment
I have defined a pipeline of annotators: tokenize, ssplit, parse, sentiment. I have run it on entire article, then looked for keywords in each sentence and, if they were present, run a method returning keyword value. I was not satisfied though that processing takes a couple of seconds.
This is the code:
List<String> keywords = ...;
String text = ...;
Map<Integer,Integer> sentenceSentiment = new HashMap<>();
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, parse, sentiment");
props.setProperty("parse.maxlen", "20");
props.setProperty("tokenize.options", "untokenizable=noneDelete");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation annotation = pipeline.process(text); // takes 2 seconds!!!!
List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
for (int i=0; i<sentences.size(); i++) {
CoreMap sentence = sentences.get(i);
if(sentenceContainsKeywords(sentence,keywords) {
int sentiment = RNNCoreAnnotations.getPredictedClass(sentence.get(SentimentCoreAnnotations.SentimentAnnotatedTree.class));
sentenceSentiment.put(sentence,sentiment);
}
}
Approach 2: StanfordCoreNLP pipeline annotating entire text with sentences, separate annotators running on sentences of interest
Because of the weak performance of the first solution, I have defined the second solution. I have defined a pipeline with annotators: tokenize, ssplit. I looked for keywords in each sentence and, if they were present, I have created an annotation only for this sentence and run next annotators on it: ParserAnnotator, BinarizerAnnotator and SentimentAnnotator.
The results were really unsatisfying because of ParserAnnotator. Even if I initialized it with the same properties. Sometimes it took even more time than entire pipeline run on a document in Approach 1.
List<String> keywords = ...;
String text = ...;
Map<Integer,Integer> sentenceSentiment = new HashMap<>();
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit"); // parsing, sentiment removed
props.setProperty("parse.maxlen", "20");
props.setProperty("tokenize.options", "untokenizable=noneDelete");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// initiation of annotators to be run on sentences
ParserAnnotator parserAnnotator = new ParserAnnotator("pa", props);
BinarizerAnnotator binarizerAnnotator = new BinarizerAnnotator("ba", props);
SentimentAnnotator sentimentAnnotator = new SentimentAnnotator("sa", props);
Annotation annotation = pipeline.process(text); // takes <100 ms
List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
for (int i=0; i<sentences.size(); i++) {
CoreMap sentence = sentences.get(i);
if(sentenceContainsKeywords(sentence,keywords) {
// code required to perform annotation on one sentence
List<CoreMap> listWithSentence = new ArrayList<CoreMap>();
listWithSentence.add(sentence);
Annotation sentenceAnnotation = new Annotation(listWithSentence);
parserAnnotator.annotate(sentenceAnnotation); // takes 50 ms up to 2 seconds!!!!
binarizerAnnotator.annotate(sentenceAnnotation);
sentimentAnnotator.annotate(sentenceAnnotation);
sentence = sentenceAnnotation.get(CoreAnnotations.SentencesAnnotation.class).get(0);
int sentiment = RNNCoreAnnotations.getPredictedClass(sentence.get(SentimentCoreAnnotations.SentimentAnnotatedTree.class));
sentenceSentiment.put(sentence,sentiment);
}
}
Questions
I wonder why parsing in CoreNLP is not "lazy"? (In my example that would mean: performed only when sentiment on a sentence is called). Is it from performance reasons?
How come a parser for one sentence can work almost as long as a parser for entire article (my article had 7 sentences)? Is it possible to configure it in a way that it works faster?
If you're looking to speed up constituency parsing, the single best improvement is to use the new shift-reduce constituency parser. It is orders of magnitude faster than the default PCFG parser.
Answers to your later questions: