I have downloaded the latest update on text classification template. I created a new app and imported stopwords.json and emails.json by specifying app id
$ pio import --appid <appID> --input data/stopwords.json
$ pio import --appid <appID> --input data/emails.json
Then I changed engine.json and given my app name in it.
{
"id": "default",
"description": "Default settings",
"engineFactory": "org.template.textclassification.TextClassificationEngine",
"datasource": {
"params": {
"appName": "<myapp>",
"evalK": 3
}
But the next step ie, evaluation fails with an error empty.maxBy
. A part of error is pasted below
[INFO] [Engine$] Preparator: org.template.textclassification.Preparator@79a13920
[INFO] [Engine$] AlgorithmList: List(org.template.textclassification.LRAlgorithm@420a8042)
[INFO] [Engine$] Serving: org.template.textclassification.Serving@faea4da
Exception in thread "main" java.lang.UnsupportedOperationException: empty.maxBy
at scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:223)
at scala.collection.AbstractTraversable.maxBy(Traversable.scala:105)
at org.template.textclassification.PreparedData.<init> (Preparator.scala:160)
at org.template.textclassification.Preparator.prepare(Preparator.scala:39)
at org.template.textclassification.Preparator.prepare(Preparator.scala:35)
at io.prediction.controller.PPreparator.prepareBase(PPreparator.scala:34)
at io.prediction.controller.Engine$$anonfun$25.apply(Engine.scala:758)
at scala.collection.MapLike$MappedValues.get(MapLike.scala:249)
at scala.collection.MapLike$MappedValues.get(MapLike.scala:249)
at scala.collection.MapLike$class.apply(MapLike.scala:140)
at scala.collection.AbstractMap.apply(Map.scala:58)
Then I tried pio train
but training also fails after showing some observations. Error shown is java.lang.OutOfMemoryError: Java heap space
. A part of the error is pasted below.
[INFO] [Engine$] Data santiy check is on.
[INFO] [Engine$] org.template.textclassification.TrainingData supports data sanity check. Performing check.
Observation 1 label: 1.0
Observation 2 label: 0.0
Observation 3 label: 0.0
Observation 4 label: 1.0
Observation 5 label: 1.0
[INFO] [Engine$] org.template.textclassification.PreparedData does not support data sanity check. Skipping check.
[WARN] [BLAS] Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
[WARN] [BLAS] Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
[INFO] [Engine$] org.template.textclassification.NBModel does not support data sanity check. Skipping check.
[INFO] [Engine$] EngineWorkflow.train completed
[INFO] [Engine] engineInstanceId=AU3g4XyhTrUUakX3xepP
[INFO] [CoreWorkflow$] Inserting persistent model
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:36)
at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:33)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:29)
Is this because of memory shortage? I have run the previous version of same template with text classification data of greater than 40mb without issues. Is evaluation a must for training? Also could you please explain how the evaluation is performed?
So I was just able to run the evaluation without the former issue, and the latter issue is related to memory usage.
Again, the
empty.maxBy
error occurs when your data isn't being read in via theDataSource
. My first guess is that if you're using a differentappName
other thanMyTextApp
, make sure you also reflect that change in yourEngineParamsList
Object that is in theEvaluation.scala
script. You'll see that you are creating aDataSourceParams
object there for Evaluation.For the
OutofMemoryError
, you should increase your driver-memory prior to training/ evaluation. This is done by doing the following:pio train -- --driver-memory xG --executor-memory yG
pio eval org.template.textclassification.AccuracyEvaluation org.template.textclassification.EngineParamsList -- --driver-memory xG --executor-memory yG
Setting --driver-memory to 1G or 2G should suffice.
As for how the evaluation is carried out, PredictionIO performs k-fold cross-validation by default. For this, your data is split into roughly k-equally sized parts. Let's say k is 3 for illustration purposes. Then a model is trained on 2/3 of the data, and the other 1/3 of the data is used as a test set to estimate prediction performance. This process is repeated for each 1/3 of the data, and then an average of the 3 performance estimates obtained is used as the final estimate for prediction performance (in a general setting you must yourself decide what is an appropriate metric to measure this). This process is repeated for each parameter setting, and model that you specify for testing.
Evaluation is not a necessary step for training and deploying, however, it is a way to select which parameters/algorithms should be used for training and deployment. It is known as model selection in machine learning/ statistics.
Edit: As for the text vectorization, each document is vectorized in the following way:
Say my document is:
"I am Marco."
The first step is to tokenize this, which would result in the following Array/List output:
["I", "am", "Marco"]
Then, you go through a bigram extraction, which stores the following set of token arrays/lists:
["I", "am"], ["am", "Marco"], ["I"], ["am"], ["Marco"]
Each one of these is a used as a feature to build vectors of bigram and word counts, and then apply a tf-idf transformation. Note that to build a vector, we must extract the bigrams from every single document, so that these feature vectors can turn out to be quite large. You can cut out a lot of this by increasing/decreasing the inverseIdfMin/inverseIdfMax values in the Preparator stage.