google data fusion xml parsing - 'parse-xml-to-json' : Mismatched close tag note at 6

693 views Asked by At

I am new to Google Cloud Data Fusion. I was able to successfully process CSV file and load into BigQuery. My requirement is process XML file and load into BigQuery. To try, i just took very simple XML

XML File:

{<?xml version="1.0" encoding="UTF-8"?> <note> <to>Tove</to <from>Jani</from>  <heading>Reminder</heading>  <body>Don't forget me this weekend!</body> </note> }

Error Message 1

java.lang.Exception: Stage:Wrangler - Reached error threshold 1, terminating processing due to error : Error encountered while executing 'parse-xml-to-json' : Mismatched close tag note at 6 [character 7 line 1]
at io.cdap.wrangler.Wrangler.transform(Wrangler.java:404) ~[1601903767453-0/:na]
at io.cdap.wrangler.Wrangler.transform(Wrangler.java:83) ~[1601903767453-0/:na]
at io.cdap.cdap.etl.common.plugin.WrappedTransform.lambda$transform$5(WrappedTransform.java:90) ~[cdap-etl-core-6.2.0.jar:na]
at io.cdap.cdap.etl.common.plugin.Caller$1.call(Caller.java:30) ~[cdap-etl-core-6.2.0.jar:na]
at io.cdap.cdap.etl.common.plugin.StageLoggingCaller.call(StageLoggingCaller.java:40) ~[cdap-etl-core-6.2.0.jar:na]
at io.cdap.cdap.etl.common.plugin.WrappedTransform.transform(WrappedTransform.java:89) ~[cdap-etl-core-6.2.0.jar:na]
at io.cdap.cdap.etl.common.TrackedTransform.transform(TrackedTransform.java:74) ~[cdap-etl-core-6.2.0.jar:na]
at io.cdap.cdap.etl.spark.function.TransformFunction.call(TransformFunction.java:50) ~[hydrator-spark-core2_2.11-6.2.0.jar:na]
at io.cdap.cdap.etl.spark.Compat$FlatMapAdapter.call(Compat.java:126) ~[hydrator-spark-core2_2.11-6.2.0.jar:na]
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) ~[scala-library-2.11.8.jar:na]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) ~[scala-library-2.11.8.jar:na]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:128) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:127) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1415) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:139) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.scheduler.Task.run(Task.scala:109) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) [spark-core_2.11-2.3.3.jar:2.3.3]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_252]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_252]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_252]

Caused by: io.cdap.wrangler.api.RecipeException: Error encountered while executing 'parse-xml-to-json' : Mismatched close tag note at 6 [character 7 line 1] at io.cdap.wrangler.executor.RecipePipelineExecutor.execute(RecipePipelineExecutor.java:149) ~[wrangler-core-4.2.0.jar:na] at io.cdap.wrangler.executor.RecipePipelineExecutor.execute(RecipePipelineExecutor.java:97) ~[wrangler-core-4.2.0.jar:na] at io.cdap.wrangler.Wrangler.transform(Wrangler.java:376) ~[1601903767453-0/:na] ... 26 common frames omitted Caused by: io.cdap.wrangler.api.DirectiveExecutionException: Error encountered while executing 'parse-xml-to-json' : Mismatched close tag note at 6 [character 7 line 1] at io.cdap.directives.xml.XmlToJson.execute(XmlToJson.java:106) ~[na:na] at io.cdap.directives.xml.XmlToJson.execute(XmlToJson.java:49) ~[na:na] at io.cdap.wrangler.executor.RecipePipelineExecutor.execute(RecipePipelineExecutor.java:129) ~[wrangler-core-4.2.0.jar:na] ... 28 common frames omitted Caused by: org.json.JSONException: Mismatched close tag note at 6 [character 7 line 1] at org.json.JSONTokener.syntaxError(JSONTokener.java:505) ~[org.json.json-20090211.jar:na] at org.json.XML.parse(XML.java:311) ~[org.json.json-20090211.jar:na] at org.json.XML.toJSONObject(XML.java:520) ~[org.json.json-20090211.jar:na] at org.json.XML.toJSONObject(XML.java:548) ~[org.json.json-20090211.jar:na] at org.json.XML.toJSONObject(XML.java:472) ~[org.json.json-20090211.jar:na] at io.cdap.directives.xml.XmlToJson.execute(XmlToJson.java:96) ~[na:na] ... 30 common frames omitted

Error Message 2:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): UnknownReason

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1661) ~[spark-core_2.11-2.3.3.jar:2.3.3] at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1649) ~[spark-core_2.11-2.3.3.jar:2.3.3] at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1648) ~[spark-core_2.11-2.3.3.jar:2.3.3] at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) ~[scala-library-2.11.8.jar:na] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) ~[scala-library-2.11.8.jar:na] at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1648) ~[spark-core_2.11-2.3.3.jar:2.3.3] at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) ~[spark-core_2.11-2.3.3.jar:2.3.3] at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) ~[spark-core_2.11-2.3.3.jar:2.3.3] at scala.Option.foreach(Option.scala:257) ~[scala-library-2.11.8.jar:na] at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) ~[spark-core_2.11-2.3.3.jar:2.3.3] at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1882) ~[spark-core_2.11-2.3.3.jar:2.3.3] at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1831) ~[spark-core_2.11-2.3.3.jar:2.3.3] at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1820) ~[spark-core_2.11-2.3.3.jar:2.3.3] at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) ~[spark-core_2.11-2.3.3.jar:2.3.3] at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) ~[spark-core_2.11-2.3.3.jar:2.3.3] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) ~[na:2.3.3] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055) ~[na:2.3.3] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087) ~[na:2.3.3] at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78) ~[spark-core_2.11-2.3.3.jar:2.3.3] at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1083) [spark-core_2.11-2.3.3.jar:2.3.3] at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081) [spark-core_2.11-2.3.3.jar:2.3.3] at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081) [spark-core_2.11-2.3.3.jar:2.3.3] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) [spark-core_2.11-2.3.3.jar:2.3.3] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) [spark-core_2.11-2.3.3.jar:2.3.3] at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) [spark-core_2.11-2.3.3.jar:2.3.3] at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1081) [spark-core_2.11-2.3.3.jar:2.3.3] at org.apache.spark.api.java.JavaPairRDD.saveAsNewAPIHadoopDataset(JavaPairRDD.scala:831) [spark-core_2.11-2.3.3.jar:2.3.3] at io.cdap.cdap.etl.spark.batch.SparkBatchSinkFactory.writeFromRDD(SparkBatchSinkFactory.java:98) [hydrator-spark-core2_2.11-6.2.0.jar:na] at io.cdap.cdap.etl.spark.batch.RDDCollection$1.run(RDDCollection.java:179) [hydrator-spark-core2_2.11-6.2.0.jar:na] at io.cdap.cdap.etl.spark.SparkPipelineRunner.runPipeline(SparkPipelineRunner.java:350) [hydrator-spark-core2_2.11-6.2.0.jar:na] at io.cdap.cdap.etl.spark.batch.BatchSparkPipelineDriver.run(BatchSparkPipelineDriver.java:148) [hydrator-spark-core2_2.11-6.2.0.jar:na] at io.cdap.cdap.app.runtime.spark.SparkTransactional$2.run(SparkTransactional.java:236) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar:na] at io.cdap.cdap.app.runtime.spark.SparkTransactional.execute(SparkTransactional.java:208) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar:na] at io.cdap.cdap.app.runtime.spark.SparkTransactional.execute(SparkTransactional.java:138) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar:na] at io.cdap.cdap.app.runtime.spark.AbstractSparkExecutionContext.execute(AbstractSparkExecutionContext.scala:228) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar:na] at io.cdap.cdap.app.runtime.spark.SerializableSparkExecutionContext.execute(SerializableSparkExecutionContext.scala:61) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar:na] at io.cdap.cdap.app.runtime.spark.DefaultJavaSparkExecutionContext.execute(DefaultJavaSparkExecutionContext.scala:89) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar:na] at io.cdap.cdap.api.Transactionals.execute(Transactionals.java:63) [na:na] at io.cdap.cdap.etl.spark.batch.BatchSparkPipelineDriver.run(BatchSparkPipelineDriver.java:116) [hydrator-spark-core2_2.11-6.2.0.jar:na] at io.cdap.cdap.app.runtime.spark.SparkMainWrapper$.main(SparkMainWrapper.scala:86) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar:na] at io.cdap.cdap.app.runtime.spark.SparkMainWrapper.main(SparkMainWrapper.scala) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar:na] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_252] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_252] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_252] at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_252] at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:56) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar:2.3.3] at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894) [na:2.3.3] at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198) [na:2.3.3] at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228) [na:2.3.3] at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) [na:2.3.3] at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) [spark-core_2.11-2.3.3.jar:2.3.3] at io.cdap.cdap.app.runtime.spark.submit.AbstractSparkSubmitter.submit(AbstractSparkSubmitter.java:172) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar:na] at io.cdap.cdap.app.runtime.spark.submit.AbstractSparkSubmitter.access$000(AbstractSparkSubmitter.java:54) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar:na] at io.cdap.cdap.app.runtime.spark.submit.AbstractSparkSubmitter$5.run(AbstractSparkSubmitter.java:111) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_252] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_252] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_252] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_252] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_252]

2

There are 2 answers

0
rmesteves On

Seems that yourXML is not correct. Try with the XML below:

<?xml version="1.0" encoding="UTF-8"?> <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>
0
docmauro001 On

probably your xml is not correct. To receive a proper error message from Data Fusion would have been better, I agree. I did the same exercice with xml files of up to 151 MB and it works!

However, I got an error for bigger files...

Here is a similar problem I had on xml-to-json conversion with Data Fusion: Data Fusion for xml-to-json transformation: "+ExitOnOutOfMemoryError" and "exited with a non-zero exit code 3. Error file: prelaunch.err"