Reading and Writing XML files through Apache Beam/Google Cloud DataFlow

2k views Asked by At

I have tried to follow the provided documentation to read XML files from a GCS location :

https://beam.apache.org/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/io/xml/XmlIO.html

It seems there is some problem in configuration and I'm missing some required pieces to make my code run. I have kept the XML file at GCS location and used the code given below to read the XML file.

public class XMLReaderWriter {

private static final Logger LOG = LoggerFactory.getLogger(XMLReaderWriter.class);

public static void main(String args[])
{

    DataflowPipelineOptions options=PipelineOptionsFactory.as(DataflowPipelineOptions.class);
     options.setTempLocation("gs://xyz_test/staging");
     options.setProject("test-1-160106");


     Pipeline p=Pipeline.create(options);

    PCollection<Record> result= p.apply(XmlIO.<Record>read()
             .from("gs://xyz_test/sample.xml")
             .withRootElement("catalog")
             .withRecordElement("title")
             .withRecordClass(Record.class));

  result.apply(ParDo.of(new DoFn<Record,String>(){
                @ProcessElement 

                public void processelement(ProcessContext c)
                {
                    System.out.println(c.element().toString());
                }
             })); 
      p.run(); 
}

Code is failing with exceptions and below is a part of stack-trace for the same :

Exception in thread "main" java.lang.RuntimeException: java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData$Record.<init>()
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:207)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:63)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:295)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:281)

Has anyone done it before? Please let me know the code changes I need to put in place.

0

There are 0 answers