Is it correct to escape "&", ">" and "<" with &#38;, &#62; and &#60; in XML?

6.5k views Asked by At

Will something "break" if I use numeric entities instead of the usual recommended alpha entities for reserved chars in XML?

This is part of a rather complex app that allows users to enter bibliographic metadata via XML, CSV or web-based forms. This data can then be extracted in XML (using the ONIX standard) with user-chosen encodings: utf-8, win-1252, etc.

The original programmers (long gone now...) decided to use numeric entities for all chars that cannot be represented in the chosen encoding. XML-reserved chars are considered as non-representable under any encoding. They are given the same treatment and are encoded using numeric entities.

Some users have complained about &, <, >, etc. being encoded as &#38, etc. instead of using the usual alpha codes and I'd like to know if these complaints have any substance.

If I can avoid digging through the legacy code to change this behaviour, it would save me a lot of resources.

1

There are 1 answers

0
Daniel Haley On BEST ANSWER

Yes, it's fine to escape using numeric character references.

From the spec (emphasis mine):

The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings "&amp;" and "&lt;" respectively. The right angle bracket (>) may be represented using the string "&gt;", and must, for compatibility, be escaped using either "&gt;" or a character reference when it appears in the string "]]>" in content, when that string is not marking the end of a CDATA section.

You could also use a hex entity reference...

&amp; = &#38; = &#x26;

&lt; = &#60; = &#x3C;

&gt; = &#62; = &#x3E;