XML::Simple returns "Out of memory" error for large XMLs

402 views Asked by At

This might take a while to explain, but I have a file (XMLList.txt) that contains the paths to multiple IDOC XMLs. The contents of the XMLList.txt look like this:

/usr/local/sterlingcommerce/data/archive/SFGprdr/SFTPGET/2017/Dec/week_4/AU_DHL_PW_Inbound_Delivery_from_Pfizer_20171220071754.xml /usr/local/sterlingcommerce/data/archive/SFGprdr/SFTPGET/2017/Dec/week_4/AU_DHL_PW_Inbound_Delivery_from_Pfizer_20171220083310.xml /usr/local/sterlingcommerce/data/archive/SFGprdr/SFTPGET/2017/Dec/week_4/CCMastOut_MQ_GLB_1_20171220154826.xml

I'm attempting to create a Perl script that reads each XML and parses just the values of the tags DOCNUM, SNDPRN and RCVPRN from each XML file into a pipe delimited file "report.csv"

Another thing to note is that my XML files could be: All on a single line - example

 <?xml version="1.0" encoding="UTF-8"?><ZDELVRY073PL><IDOC BEGIN="1">
    <EDI_DC40 SEGMENT="1"><TABNAM>EDI_DC40</TABNAM><MANDT>400</MANDT>
    <DOCNUM>0000000443474886</DOCNUM><DOCREL>731</DOCREL><STATUS>30</STATUS>
    <DIRECT>1</DIRECT><OUTMOD>4</OUTMOD><IDOCTYP>DELVRY07</IDOCTYP>
    <CIMTYP>ZDELVRY073PL</CIMTYP><MESTYP>ZIBDADV</MESTYP><MESCOD>IBG</MESCOD>
    <SNDPOR>SAPQ01</SNDPOR><SNDPRT>LS</SNDPRT><SNDPRN>Q01CLNT400</SNDPRN>
    <RCVPOR>XMLDIST_MT</RCVPOR><RCVPRT>LS</RCVPRT><RCVPFC>LS</RCVPFC>
    <RCVPRN>AU_DHL</RCVPRN>.... </EDI_DC40></IDOC>

or multiline XML:

  <?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
    <INVOIC02>
      <IDOC>
        <EDI_DC40>
      <TABNAM/>
      <DOCNUM>0000000658056255</DOCNUM>
      <DIRECT/>
      <IDOCTYP>INVOIC02</IDOCTYP>
      <MESTYP>INVOIC</MESTYP>
      <SNDPOR>SAPP01</SNDPOR>
      <SNDPRT/>
      <SNDPRN>ALE400</SNDPRN>
      <RCVPOR>XMLINVOICE</RCVPOR>
      <RCVPRT>KU</RCVPRT>
      <RCVPRN>C18BASWARE</RCVPRN>
      <CREDAT>20171220</CREDAT>
      <CRETIM>222323</CRETIM>
    </EDI_DC40>

The script I've used so far seems to work for small XMLs. However, some XMLs > 50 MB throw this error:

Out of memory! Out of memory! Callback called exit at /usr/opt/perl5/lib/site_perl/5.10.1/XML/SAX/Base.pm line 1941 (#1) (F) A subroutine invoked from an external package via call_sv() exited by calling exit.

Out of memory!

So, here's the code I'm using. Would like your help tweaking this:

#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
# use module
use XML::Simple;
use Data::Dumper;

# create object
my $xml = new XML::Simple; 

my $file_list = 'XMLList.txt';
open(my $fh_i, '<:encoding(UTF-8)', $file_list)
  or die "Could not open file '$file_list' $!";

my $csv_out = 'report.csv';
open(my $fh_o, '>', $csv_out)
  or die "Could not open file '$csv_out' $!"; 

while (my $row = <$fh_i>) {
  $row =~ s/\R//g;
  my $data = $xml->XMLin($row);
  print $fh_o "$data->{IDOC}->{EDI_DC40}->{DOCNUM}|";
  print $fh_o "$data->{IDOC}->{EDI_DC40}->{SNDPRN}|";
  print $fh_o "$data->{IDOC}->{EDI_DC40}->{RCVPRN}\n";
}

close $fh_o;
2

There are 2 answers

0
Curt Evans On

First off, if the file contains newlines,

  while (my $row = <$fh_i>){
  $row =~ s/\R//g;
  my $data = $xml->XMLin($row);

is going to read one line at a time from the file and attempt to do an XML conversion on that line alone instead of the whole document. I would recommend that you slurp each file into a buffer and use regex to eliminate newlines and carriage returns before XMLin conversion. Also, XMLin will die unceremoniously if there are any XML errors in the file, so you want to run it in an eval block.

0
brian d foy On

I recommend that people stop using XML::Simple when they have a problem using it. That module is nice to get started but its not meant to be a long term solution. Even then, see Why is XML::Simple “Discouraged”?

XML::Twig is what I often use for these tasks. You can set up handlers for tags and get that part of the tree. You process it and move on. That might be as simple as something like this where I set up a subroutine to process each EDI_DC40 as I encounter it:

use Text::CSV_XS;
use XML::Twig;

my $csv = Text::CSV_XS->new;

my $twig = XML::Twig->new(   
    twig_handlers => { 
        'EDI_DC40' => \&process_EDI_DC40,
        },
    );

$twig->parsefile( $ARGV[0] );

sub process_EDI_DC40 {
    my( $twig, $thingy ) = @_;

    my @values = map { $thingy->first_child( $_ )->text } 
        qw(DOCNUM RCVPRN SNDPRN);

    $csv->say( *STDOUT, \@values );
    }