PredictionIO train error tokens must not be empty

338 views Asked by At

I am tinkering with predictioIO to build a custom classification engine. I have done this before without issues. But for current dataset pio train is giving me an error tokens must not be empty.I have edited Datasource.scala to mention fields in dataset to engine. A line from my dataset is as below

{"event": "ticket", "eventTime": "2015-02-16T05:22:13.477+0000", "entityType": "content","entityId": 365,"properties":{"text": "Request to reset svn credentials","label": "Linux/Admin Task" }}

I can import data and build engine without any issues. I am getting a set of observations too. The error is pasted below

[INFO] [Remoting] Starting remoting
[INFO] [Remoting] Remoting started; listening on addresses :[akka.tcp://[email protected]:50713]
[INFO] [Engine$] EngineWorkflow.train
[INFO] [Engine$] DataSource: org.template.textclassification.DataSource@4fb64e14
[INFO] [Engine$] Preparator: org.template.textclassification.Preparator@5c4cc644
[INFO] [Engine$] AlgorithmList: List(org.template.textclassification.NBAlgorithm@62b6c045)
[INFO] [Engine$] Data sanity check is off.
[ERROR] [Executor] Exception in task 0.0 in stage 2.0 (TID 2)
[WARN] [TaskSetManager] Lost task 0.0 in stage 2.0 (TID 2, localhost): java.lang.IllegalArgumentException: tokens must not be empty
at org.template.textclassification.PreparedData$$anonfun$2.apply(Preparator.scala:113)
at org.template.textclassification.PreparedData$$anonfun$2.apply(Preparator.scala:113)
at scala.collection.Iterator$$anon$
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:202)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.executor.Executor$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$

[ERROR] [TaskSetManager] Task 0 in stage 2.0 failed 1 times; aborting job
Exception in thread "main" org.apache.spark.SparkException: Job aborted   due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent  failure: Lost task 0.0 in stage 2.0 (TID 2, localhost):    java.lang.IllegalArgumentException: tokens must not be empty
at  org.template.textclassification.PreparedData$$anonfun$2.apply(Preparator.scala:113)
at  org.template.textclassification.PreparedData$$anonfun$2.apply(Preparator.scala:113)
at scala.collection.Iterator$$anon$
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at  org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:202)
at  org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
at  org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at  org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.executor.Executor$
at  java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$

The problem is with dataset. I did splitting up dataset into parts and trained. Training was completed for that dataset and no errors were reported. How can I know which line in the dataset produce error? It should be very very helpful if this feature is in PredictionIO .


There are 1 answers

Marco Vivero On BEST ANSWER

So this is something that happens when you feed in an empty Array[String] to OpenNLP's StringList constructor. Try modifying the function hash in Prepared Data as follows:

private def hash (tokenList : Array[String]): HashMap[String, Double] = {
// Initialize an NGramModel from OpenNLP tools library,
// and add the list of allowable tokens to the n-gram model.
try {
  val model : NGramModel = new NGramModel()
  model.add(new StringList(tokenList: _*), nMin, nMax)

  val map : HashMap[String, Double] = HashMap(
      x => (x.toString, model.getCount(x).toDouble)
    ).toSeq : _*

  val mapSum = map.values.sum

  // Divide by the total number of n-grams in the document
  // to obtain n-gram frequency. => (e._1, e._2 / mapSum))
} catch {
  case (e : IllegalArgumentException) => HashMap("" -> 0.0)

I've only encountered this issue in the prediction stage, and so you can see this is actually implemented in the models' predict methods. I'll update this right now, and put it in a new version release. Thank you for the catch and feedback!