What is the best way to replace the punctuation in tag names?

177 views Asked by At

I have my reasons to replace the punctuation character with a underscore in all tag names (please don't ask me why it is not relevant for the question).

What is relevant to the question is that I would like to:

<data:data>
    <another:data>Content</another:data>
    <another:data>Content</another:data>
    <another:data>Content</another:data>
    <another:data attribute="attr : content">This content should : not be affected</another:data>
    <another:data><![CDATA[This content should : not be affected]]></another:data>
</data:data>

Replace with:

<data_data>
    <another_data>Content</another_data>
    <another_data>Content</another_data>
    <another_data attribute="attr : content">This content should : not be affected</another_data>
    <another_data><![CDATA[This content should : not be affected]]></another_data>
</data_data>

But what is the best way to perform this with php?

I know that regex is not a proper way to parse html or xml but I'm afraid that I'm attached to use preg_replace() in my situation because DOMDocument() can't read my ~250K rows of bad structured namespaced provided xml- content. The provided xsd files (~25 schemes) are outdated (for 6 years now), the content- provider is unwilling to fix this.

I found out that SimpleXMLElement() works after replacing the : with _.

4

There are 4 answers

3
Rodrigo López On BEST ANSWER

You can capture what is between < and > then replace : with _, like this:

$string = "<data:data>
<another:data:data>Content:</another:data>
<another:data>:Content</another:data>
<another:data>Content</another:data>
<another:data><![CDATA[This content should : not be affected]]>Content</another:data>
</data:data>";

$regex = '~<[^!][^>]*>~';
$replaced = preg_replace_callback(
    $regex,
    function($m) { return str_replace(':', '_', $m[0]);},
    $string);

echo $replaced;

Output:

<data_data>                                                                                                                                                                                          
<another_data_data>Content:</another_data>                                                                                                                                                           
<another_data>:Content</another_data>                                                                                                                                                                
<another_data>Content</another_data>                                                                                                                                                                 
<another_data><![CDATA[This content should : not be affected]]>Content</another_data>                                                                                                                                                                   
</data_data>
1
Robo Robok On

If you don't use attributes, this code will work for you:

$string = preg_replace_callback(
    '#</?[\w:]+>#',
    function ($match) {
        list($tag) = $match;
        return str_replace(':', '_', $tag);
    },
    $string
);

If you do use attributes, check this out: How do I change XML tag names with PHP?

0
apgp88 On

You can try following regex,

<\/?\w+(:)\w+>

Working Demo

You can use Group capture to replace it with _

8
SuperDJ On

Do you mean the following:

$string = "<data:data>
<another:data>Content</another:data>
<another:data>Content</another:data>
<another:data>Content</another:data>
<another:data>Content</another:data>
</data:data>";

$string = str_replace(':', '_', $string);

or

$string = str_replace('another:data', 'another_data', $string);

Update

Maybe you can try the following:

$replace = array('another:data' => 'another_data', '/another:data' => '/another_data'); // So you can easily add more strings to replace
strtr($string, $replace);

link: http://php.net/strtr. I just found this so don't know if this is something for you.