XML::Twig - set_text without clobbering structure

403 views Asked by At

With XML::Twig using the set_text method - there is a warning:

set_text ($string) Set the text for the element: if the element is a PCDATA, just set its text, otherwise cut all the children of the element and create a single PCDATA child for it, which holds the text.

So if I want to do something simple, like - say - changing the case of all the text in my XML::Document:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

my $twig = XML::Twig->new(
    'pretty_print'  => 'indented_a',
    'twig_handlers' => {
        '_all_' => sub {
            my $newtext = $_->text_only;
            $newtext =~ tr/[a-z]/[A-Z]/;
            $_->set_text($newtext);
        }
    }
);
$twig->parse( \*DATA );
$twig->print;

__DATA__
<root>
    <some_content>fish
        <a_subnode>morefish</a_subnode>
    </some_content>
    <some_more_content>cabbage</some_more_content>
</root>

This - because of set_text replacing children - gets clobbered into:

<root></root>

But if I focus on just one (bottom level) node (e.g. a_subnode) then it works fine.

Is there an elegant way to replace/transform text within an element without clobbering the data structure below it? I mean, I can do test on the presence of children or something similar, but ... there seems like there should be a better way of doing this. (A different library maybe?)

(And for the sake of clarity - this is my example of transliterating all the text in a document, my actual use case is rather more convoluted, but is still 'about' in place text tranformation).

I'm considering perhaps a node cut/and/paste approach (cut all children, replace text, paste all children) but that seems to be an inefficient approach.

2

There are 2 answers

6
mirod On BEST ANSWER

Instead of having the handler on _all_, try having it only on text elements: #TEXT, and change text_only to text. It should work.

update: Or use the char_handler option when you create the twig: char_handler => sub { uc shift }, instead of the handler.

0
Sobrique On

My current approach is to:

  • iterate all the nodes.
  • cut all the children.
  • amend the text.
  • paste all the children.

This seems inefficient, but it does appear to work:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;
use Data::Dumper;

sub replace_text {
    my ( $twig, $element ) = @_;

    my $newtext = $element->text_only;
    my @children;
    foreach my $child ( $element->children ) {
        if ( not $child->tag eq "#PCDATA" ) {
            push( @children, $child->cut );
        }
    }
    $newtext =~ tr/[a-z]/[A-Z]/;
    $element->set_text($newtext);

    $_->paste( 'last_child', $element ) for @children;
}

my $twig =
    XML::Twig->new( 'twig_handlers' => { '_all_' => \&replace_text, } );
$twig->parse( \*DATA );

print "Result:\n";
$twig->print;

__DATA__
<root>
    <some_content>fish
        <a_subnode>morefish</a_subnode>
    </some_content>
    <some_more_content>cabbage</some_more_content>
</root>

This turns my output into:

<root><some_content>FISH
        <a_subnode>MOREFISH</a_subnode></some_content><some_more_content>CABBAGE</some_more_content></root>

So whilst it does transmogrify the nodes, it also for some reason, breaks the output format.

Reparsing it:

XML::Twig -> new ( 'pretty_print' => 'indented_a' ) -> parse ( $twig -> sprint ) -> print;

Seems to do the trick. (Although double parsing just to reformat seems even less elegant)

<root>
  <some_content>FISH
        <a_subnode>MOREFISH</a_subnode></some_content>
  <some_more_content>CABBAGE</some_more_content>
</root>