PCDATA vs CDATA in XML DTD

7.1k views Asked by At

In XML DTD's - When defining an element , we use #PCDATA to say that this element can contain any parseable text. When defining an attribute , we use CDATA to say that its value can be any character data.

CDATA as is used in XML is something which is not parsed by the XML parser (Multi character escape sequence). Consistently, when we use CDATA for defining an attribute ; the parser should not parse it. But , it does!

Then , Why Could not PCDATA have been used in place of CDATA for defining attributes?

Update - This has been kept this way to be backward compatible with SGML. What's the reasoning behind such naming in SGML ?

2

There are 2 answers

7
Daniel Haley On

A CDATA section, like you would use in an element, is different from the CDATA attribute type.

The parsing that you are most likely observing (like entity references being resolved) is from attribute-value normalization.

6
Javier On

When used in the declared value of an attribute CDATA refers to the actual value of the attribute (character data), not to the context in which it is parsed. On the other hand, when parsing elements we need a distinction between character-data-with-no-markup (CDATA) and parsed-character-data-where-delimiters-are expected (PCDATA) .

At first glance this seems arbitrary, but it is not (see here and here).

In SGML, an attribute value specification may either be quoted (attribute value literal) or unquoted (attribute value).

attribute value specification = attribute value literal | attribute value

When the attribute is unquoted, only NAME-characters are allowed and that may be further restricted for some declared values such as NUMBER.

The content of an attribute value literal, on the other hand, is a sequence of replaceable character data surrounded by LIT/LITA delimiters (double and single quotes, respectively, in the reference concrete syntax).

attribute value literal =
   ( LIT , replaceable character data *, LIT) | 
   ( LITA , replaceable character data *, LITA)

Replaceable character data is "like CDATA except that entity references and character references are recognized" (Goldfarb, the SGML Handbook).

It follows that the replacement of entity references in attribute value literals do not depend on the declared value of the attribute. Therefore, if you have <!ENTITY foo "bar"> and <elem attr="&foo;"> the entity reference &foo; will be parsed in the context of replaceable character data (LIT recognition mode), yielding <elem attr=bar>. It does not matter if attr is declared as CDATA, NAME or whatever.

Update

There is no need to say that entities in an attribute have to be parsed, because all attribute types have the same parsing rules: if the attribute value starts with a quote (LIT), then entities are recognized (replaceable character data) and the value ends when a matching end-quote is found.

Here CDATA means that a valid attribute must contain arbitrary character data after expanding entities. Had the attribute been declared as NUMBER, it would have been required to contain numeric characters (or entities that are expanded to numeric characters).

In the example above, the CDATA attribute with value "&foo;" is equivalent to "bar", in the same way that a NUMBER attribute with value "&#48;" is equivalent to "0" (even though the sequence "&#48;" contains characters other than numeric).