PHP SimpleXMLElement not making quotes into entities

1.4k views Asked by At

I'm writing a PHP script to generate some xml docs and I am having some issues with SimpleXML and quotes.

If I have code like this:

$xml = new SimpleXMLElement('<myxml />');
$xml->addChild('title','My Feed');
$xml->addChild('description','Entity data here - &amp; &quot; &apos;');

If I print_r the $xml obj then I get this:

print_r($xml);
SimpleXMLElement Object
(
  [title] => My Feed
  [description] => Entity data here - & " '
)

Which seems that once in the object it makes the entities back into their respective characters. However when I call asXML() on the object to get the XML it shows me this:

echo $xml->asXML();
<?xml version="1.0"?>
  <myxml>
  <title>My Feed</title>
  <description>Entity data here - &amp; " '</description>
</myxml>

It make the & back into an entity, but it seems to leave the quotes as characters. Shouldn't it convert them all to entities?

2

There are 2 answers

0
IMSoP On

" and ' are only special characters in XML if they are inside an attribute value. Within the text content of an element, there is no ambiguity as to the meaning of " or ', as the next special token being looked for is < to start an opening or closing tag.

So while <foo bar="hello "world"" /> is invalid XML, <foo>hello "world"</foo> is not, so no escaping is required.

(Just because it's not required, doesn't mean it's not possible, so there may be a fuller answer as to why SimpleXML doesn't at least retain the entities you'd put there voluntarily.)

0
vandroid On

Seems like the problem is a bit deeper: https://bugs.php.net/bug.php?id=49437

It's a "feature" or a bug in underlying libxml library: by default all entities (except lt gt and &) are being unescaped.

The irony here is that you can explicitly enable unescaping (LIBXML_NOENT), but you cannot disable it.

I found the only way to deal with it: use CData nodes where data left untouched as it is. But that's not an ideal solution of course.