Sequencefiles which map a single key to multiple values

208 views Asked by At

I am trying to do some preprocessing on data that will be fed to LucidWorks Big Data for indexing. LWBD accepts SolrXML in the form of Sequencefile files. I want to create a Pig script which will take all the SolrXML files in a directory and output them in the format

filename_1 => <here goes some XML>
...
filename_N => <here goes some more XML>

Pig's native PigStorage() load function can automatically create a column that includes the name of the file from which the data was extracted, which ideally would look like this:

{"filename_1", "<here goes some XML>"}
...
{"filename_N", "<here goes some more XML>"}

However, PigStorage() also automatically uses '\n' as a line delimiter, so what I actually end up with is a bag that looks like this:

{"filename_1", "<some partial XML from file 1>"}
{"filename_1", "<some more partial XML from file 1>"}
{"filename_1", "<the end of file 1>"}
...

I'm sure you get the picture. My question is, if I were to write this bag to a SequenceFile, how would it be read by other applications? Could it be combined as

"filename_1" => "<some partial XML from file 1>
                 <some more partial XML from file 1>
                 <the end of file 1>"

, by the default handling of the application I feed it to? Or is there some post-processing that I can do to get it into this format? Thank you for your help.

1

There are 1 answers

0
mr2ert On

Since I can't find anything about a builtin SequenceFile writer, I'm assuming you are using a UDF (and if you aren't, then you need to).

You'll have to group the files (by filename) ahead of time, and then send that to the writer UDF.

DESCRIBE xml ;
-- xml: {filename: chararray, xml_data: chararray}

B = FOREACH (GROUP xml BY filename)
    GENERATE group AS filename, xml.xml_data AS all_xml_data ;

Depending on how you have written the SequenceFile writer, it may be easier to convert the all_xml_data bag ahead of time to a chararray using a Python UDF like:

@outputSchema('xml_complete: chararray')
def stringify(bag):
    delim = ''
    return delim.join(bag)

NOTE: It is important to realize that this way the order of the xml data will become jumbled. If possible based on your data, stringify can maybe be expanded upon the reorgize it.