PredictionIO train error tokens must not be empty

Question

PredictionIO train error tokens must not be empty

351 views Asked by cutteeth At 23 June 2015 at 10:19

I am tinkering with predictioIO to build a custom classification engine. I have done this before without issues. But for current dataset pio train is giving me an error tokens must not be empty.I have edited Datasource.scala to mention fields in dataset to engine. A line from my dataset is as below

{"event": "ticket", "eventTime": "2015-02-16T05:22:13.477+0000", "entityType": "content","entityId": 365,"properties":{"text": "Request to reset svn credentials","label": "Linux/Admin Task" }}

I can import data and build engine without any issues. I am getting a set of observations too. The error is pasted below

[INFO] [Remoting] Starting remoting
[INFO] [Remoting] Remoting started; listening on addresses :[akka.tcp://[email protected]:50713]
[INFO] [Engine$] EngineWorkflow.train
[INFO] [Engine$] DataSource: org.template.textclassification.DataSource@4fb64e14
[INFO] [Engine$] Preparator: org.template.textclassification.Preparator@5c4cc644
[INFO] [Engine$] AlgorithmList: List(org.template.textclassification.NBAlgorithm@62b6c045)
[INFO] [Engine$] Data sanity check is off.
[ERROR] [Executor] Exception in task 0.0 in stage 2.0 (TID 2)
[WARN] [TaskSetManager] Lost task 0.0 in stage 2.0 (TID 2, localhost): java.lang.IllegalArgumentException: tokens must not be empty
at opennlp.tools.util.StringList.<init>(StringList.java:61)
at org.template.textclassification.PreparedData.org$template$textclassification$PreparedData$$hash(Preparator.scala:71)
at org.template.textclassification.PreparedData$$anonfun$2.apply(Preparator.scala:113)
at org.template.textclassification.PreparedData$$anonfun$2.apply(Preparator.scala:113)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:202)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

[ERROR] [TaskSetManager] Task 0 in stage 2.0 failed 1 times; aborting job
Exception in thread "main" org.apache.spark.SparkException: Job aborted   due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent  failure: Lost task 0.0 in stage 2.0 (TID 2, localhost):    java.lang.IllegalArgumentException: tokens must not be empty
at opennlp.tools.util.StringList.<init>(StringList.java:61)
at    org.template.textclassification.PreparedData.org$template$textclassification$PreparedData$$hash(Preparator.scala:71)
at  org.template.textclassification.PreparedData$$anonfun$2.apply(Preparator.scala:113)
at  org.template.textclassification.PreparedData$$anonfun$2.apply(Preparator.scala:113)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at  org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:202)
at  org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
at  org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at  org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at  java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

The problem is with dataset. I did splitting up dataset into parts and trained. Training was completed for that dataset and no errors were reported. How can I know which line in the dataset produce error? It should be very very helpful if this feature is in PredictionIO .

Original Q&A

There are 1 answers

**Marco Vivero** · Accepted Answer · 2015-06-23T18:24:22+00:00

So this is something that happens when you feed in an empty Array[String] to OpenNLP's StringList constructor. Try modifying the function hash in Prepared Data as follows:

private def hash (tokenList : Array[String]): HashMap[String, Double] = {
// Initialize an NGramModel from OpenNLP tools library,
// and add the list of allowable tokens to the n-gram model.
try {
  val model : NGramModel = new NGramModel()
  model.add(new StringList(tokenList: _*), nMin, nMax)

  val map : HashMap[String, Double] = HashMap(
    model.iterator.map(
      x => (x.toString, model.getCount(x).toDouble)
    ).toSeq : _*
  )

  val mapSum = map.values.sum

  // Divide by the total number of n-grams in the document
  // to obtain n-gram frequency.
  map.map(e => (e._1, e._2 / mapSum))
} catch {
  case (e : IllegalArgumentException) => HashMap("" -> 0.0)
}

I've only encountered this issue in the prediction stage, and so you can see this is actually implemented in the models' predict methods. I'll update this right now, and put it in a new version release. Thank you for the catch and feedback!

TechQA.

PredictionIO train error tokens must not be empty

There are 1 answers

Related Questions in TOKEN

Related Questions in TEXT-CLASSIFICATION

Related Questions in TRAINING-DATA

Related Questions in PREDICTIONIO

Popular Questions

Popular Tags

Trending Questions