How do we work with large XML schema/XML messages on GreenPlum?

384 views Asked by At

How do we work with large XML schema(XSD)/XML messages on GreenPlum?

I reviwed the GreenPlum DBA guide. It talks about processing XML messages using STX tranformation.

If we have an XSD and we need to process the XML messages that was compliant with XSD.

How do we approach this situation in GreenPlum?

What are the best approaches?

Are there any other approaches to process XSD based /XML in Greenplum?

Thanks for your insights.

1

There are 1 answers

3
0x0FFF On BEST ANSWER

In general, there's no built-in functionality for processing XML files in Greenplum. The thing you are talking about is the gpfdist client-side transformation that allows you to convert your XML to CSV and then process it inside of the database.

The best option in processing the XMLs inside of the Greenplum would be to decompose them to the relational model and load as a set of tables. If you need to process just a general XML files, you can write a function in PL/Java or PL/Python that would parse XML for you and extract the information you need.

Here's an example of the function:

create or replace function parse (inxml varchar, node varchar) returns varchar as $BODY$
from xml.etree import ElementTree
def getText (node, subnode):
    res = None
    if node is not None:
        sn = node.find(subnode)
        if sn is not None:
            res = sn.text
        else:
            for el in node:
                res2 = getText(el, subnode)
                if res2 is not None:
                    res = res2
                    break
    return res
et = ElementTree.fromstring(inxml.strip())
return getText(et, node)
$BODY$
language plpythonu
volatile;

And here's an example of its invocation:

select parse (
'<?xml version="1.0" encoding="utf-8"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Dont forget me this weekend!</body>
<attachments>
<attachment>file1</attachment>
<attachment>file2</attachment>
<attachment>file3</attachment>
</attachments>
</note>',
'attachment');