Multiple file generation while writing to XML through Apache Beam

581 views Asked by At

I'm trying to write an XML file where the source is a text file stored in GCS. The code is running fine but instead of a single XML file, it is generating multiple XML files. (No. of XML files seem to follow total no. of records present in source text file). I have observed this scenario while using 'DataflowRunner'.

When I run the same code in local then two files get generated. First one contains all the records with proper elements and the second one contains only opening and closing root element.

Any idea about the occurrence of this unexpected behaviour? please find below the code snippet I'm using :

PCollection<String>input_records=p.apply(TextIO.read().from("gs://balajee_test/xml_source.txt"));

   PCollection<XMLFormatter> input_object= input_records.apply(ParDo.of(new DoFn<String,XMLFormatter>(){
        @ProcessElement

        public void processElement(ProcessContext c)
        {
            String elements[]=c.element().toString().split(",");

            c.output(new XMLFormatter(elements[0],elements[1],elements[2],elements[3],elements[4]));

            System.out.println("Values to be written have been provided to constructor ");

        }
    })).setCoder(AvroCoder.of(XMLFormatter.class));

   input_object.apply(XmlIO.<XMLFormatter>write()
              .withRecordClass(XMLFormatter.class)
              .withRootElement("library")
              .to("gs://balajee_test/book_output"));

Please let me know the way to generate a single XML file(book_output.xml) at output.

1

There are 1 answers

1
jkff On

XmlIO.write().to() is documented as follows:

/**
 * Writes to files with the given path prefix.
 *
 * <p>Output files will have the name {@literal {filenamePrefix}-0000i-of-0000n.xml} where n is
 * the number of output bundles.
 */

I.e. it is expected that it may produce multiple files: e.g. if the runner chooses to process your data parallelizing it into 3 tasks ("bundles"), you'll get 3 files. Some of the parts may turn out empty in some cases, but the total data written will always add up to the expected data.

Asking the IO to produce exactly one file is a reasonable request if your data is not particularly big. It is supported in TextIO and AvroIO via .withoutSharding(), but not yet supported in XmlIO. Please feel free to file a JIRA with the feature request.