I am using OpenNLP Token Name finder for parsing the Unstructured data, I have created a corpus(training set) of 4MM records but as I am creating a model out of this corpus using OpenNLP API's in Eclipse, process is taking around 3 hrs which is very time consuming. Model is building on default parameters that is iteration 100 and cutoff 5.
So my question is, how can I speed up this process, how can I reduce the time taken by the process for building the model.
Size of the corpus could be the reason for this but just wanted to know if someone came across this kind of problem and if so, then how to solve this.
Please provide some clue.
Thanks in advance!
Usually the first approach to handle such issues is to split the training data to several chunks, and let each one to create a model of its own. Afterwards you merge the models. I am not sure that this is valid in this case (I'm not an OpenNLP expert), there's another solution below. Also, as it seems that the OpenNLP API provides only a single threaded train() methods, I would file an issue requesting a multi threaded option.
For a slow single threaded operation the two main slowing factors are IO and CPU, and both can be handled separately:
An option you may want to consider to to get an high CPU server from Amazon web services or Google Compute Engine and run the training there - you can download the model afterwards. Both give you high CPU servers utilizing Xeon (Sandy Bridge or Ivy Bridge) CPUs and local SSD storage.