java: unescaped quotes terminate xml text node value

1.1k views Asked by At

I'm writing an android app in java. The app emulates flashcards, with questions on one side and answers on the other.
I am presently slurping a well-formed (as I believe) .xml document (which is produced by a Qt-based program which has no problem reading the output back in) using the following (fairly standard) code:

    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    try
    {
        DocumentBuilder builder = factory.newDocumentBuilder();
        Document dom = builder.parse(new File(diskLocation));
        Element pack = dom.getDocumentElement();
        NodeList flashCards = pack.getElementsByTagName("flashcard");
        for (int i=0; i < flashCards.getLength(); i++)
        {
            FlashCard flashCard = new FlashCard();

            Node cardNode = flashCards.item(i);
            NodeList cardProperties = cardNode.getChildNodes();
            for (int j=0;j<cardProperties.getLength();j++)
            {
                Node cardProperty = cardProperties.item(j);
                String propertyName = cardProperty.getNodeName();
                if (propertyName.equalsIgnoreCase("Question"))
                {
                    flashCard.setQuestion(cardProperty.getFirstChild().getNodeValue());
                }
                else if (propertyName.equalsIgnoreCase("Answer"))
                {
                    flashCard.setAnswer(cardProperty.getFirstChild().getNodeValue());
                }
                else if
                    ...etc.

Here is a flashcard for learning xml:

 <flashcard>
  <Question>What is the entity reference for ' " '?</Question>
  <Answer>&amp;quot;</Answer>
  <Info></Info>
  <Hint></Hint>
  <KnownLevel>1</KnownLevel>
  <LastCorrect>1</LastCorrect>
  <CurrentStreak>4</CurrentStreak>
  <LevelUp>4</LevelUp>
  <AnswerTime>0</AnswerTime>
 </flashcard>

As I understand the standard, '<' and '&' need to be escaped ('>' probably should be), but quotes and apostrophes don't (unless they're in attributes), yet when the question and answer for this card are parsed, they come out as What is the entity reference for ' and & respectively;

The input seems to follow standards. Is the java XMLDom implementation really not standards-compliant, or am I missing something?

I find it very difficult to believe I'm the only one to have (had) this problem, yet I've searched both google and stack overflow and found surprisingly little of direct relevance.

Thank you for any help!

Rob

Edit: I've just realised the file has a !DOCTYPE, but doesn't start with an <?xml tag.
I wonder if this makes any difference.

1

There are 1 answers

3
LJ2 On

From the standard:

In the content of elements, character data is any string of characters which does not contain the start-delimiter of any markup

which means that either ' or " MUST be escaped in the content of elements.