I'm using the Stanford coreNLP system with the following command:
java -cp stanford-corenlp-3.5.2.jar:stanford-chinese-corenlp-2015-04-20-models.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -annotators segment,ssplit -file input.txt
And this is working great on small chinese texts. However, I need to train a MT system which just requires me to segment my input. So I just need to use -annotators segment
, but with this parameters the system outputs an empty file. I could run the tool using the ssplit
annotator as well but I don't want to do that because my input is a parallel corpora that contains one sentence by line already, and the ssplit will probably not split sentences perfectly and create problems in the parallel data.
Is there a way to tell the system to do the segmentation only, or to tell it that the input already contains a sentence by line exactly?
Using Stanford Segmenter instead:
Other than Stanford Segmenter, there are many other segmenter might be more suitable, see Is there any good open-source or freely available Chinese segmentation algorithm available?
To continue using the Stanford NLP tools for pos tagging: