How to annotate long text more efficiently with Stanford CoreNLP Server?

302 views Asked by At

I'm trying to annotate 200k documents, one document at a time with Stanford CoreNLP. Each document contains 200 numbers of sentences in average, or equivalently 6k tokens.

I'm not familiar with java, so I'm using pycorenlp. I start the server with the following command as suggusted (I edited it with extra arguments later).

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer

I'm using java 1.8 and python 3.6. Below are the problems I encountered and ways I've tried to solve, followed by my questions:

1.Java OutOfMemory, GC overhead limit exceeded:

I did: increase java memory and add -XX:+UseG1GC -XX:+UseStringDeduplication -XX:+PrintStringDeduplicationStatistics

Effect: So far it's fine. Not sure about what will happen later. I've not been able to process all text.

Problem: Not yet.

2. Connection/broken pipe issues

I did: Shutdown server every four document in my code. (I am able to process maximum 4 documents. Sometimes I'm even unable process one document)

Effect: It seem to work fine. However, it seems rather slow after the server restarts

Question: Any better or smart solutions? If I keep doing this, will I get myself into trouble in terms of server usage wise, like prohibition from using the server?

3. Low Speed

I did: Increase threads to 12, 18 when calling the server.

Effect: Work a lot better than 1 thread

Question: Is there any suggestion for speeding up? It takes almost half an hour to process even one document, due to the length of documents, although I'm calling a few annotators. (I understand it takes more time when more annotators are used, but I still need to use them.)

4. No response from the server at all. Not even with errors.

This is the most painful problem. Since I don't really have an IT background, it becomes very hard to figure out where the problems lies. Below is where the program gets stuck. No warnings, no errors. Maybe one hour later, it will continue. Or, it can stay there forever until I kill the program.

[pool-1-thread-2] INFO CoreNLP - [/127.0.0.1:56846] API call w/annotators tokenize,ssplit,pos,depparse,lemma,ner,parse,dcoref,natlog,openie
the same field as the previous one is detected with magnitude @xmath107 photometric redshift, like borg_ 0240- 1857_ 129, is peaked at @xmath111, with a broad higher- redshift wing......(further content omitted)

Any prompt response will be much appreciated, especially for the third and four issues. I've thoroughly looked into the official document and github, but I couldn't find any solutions. On the official document, it says to limit the size of a document, say to a chapter, instead of the whole novel. Hence I presume the length of a single document in the dataset is fine.


0

There are 0 answers