Unmashalling with StAX - it skips elements if there's no whitespace between them

1k views Asked by At

Context

I need to parse an XML. This XML is big so I use StAx to process every element that I'm interested. I use the the default implementation that comes with the JDK.

Problem

When an XML element precedes another element the same type (for example <person>) and there's no any character between them, it skips the second. So if I have 10 one after another I can only unmarshall 5 persons. For example:

<people><person>..</person><person>..</person></people>

I built a test to show this behaviour against a piece of code encapsulated in a method countUnmarshalledPersonEntities().

The thing is, when there're spaces between the elements like:

<people><person><id>1</id></person> <person><id>2</id></person></people>

It unmarshall two entities and that's OK.

But when there's no spaces between nodes like:

<people><person><id>1</id></person><person><id>2</id></person></people>

The first unmarshalling skips the next open tag <person>, and then the second person is ignored. I only parse 1 entity.

Test

package org.opensource.lab.stream;

import static org.junit.Assert.assertEquals;

import java.io.InputStream;

import javax.xml.bind.JAXBContext;
import javax.xml.bind.Unmarshaller;
import javax.xml.bind.annotation.XmlRootElement;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;

import org.apache.commons.io.IOUtils;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;

public class StreamParserProblemTest {
    private XMLInputFactory xmlif;
    private XMLStreamReader xmlStreamReader;
    private Unmarshaller personUnmarshaller;

    private final InputStream xmlStreamPersonsNoSeparated = IOUtils.toInputStream(
            "<people><person><id>1</id></person><person><id>2</id></person></people>"
            );
    private final InputStream xmlStreamWithPersonsWhitespaceSeparated = IOUtils.toInputStream(
            "<people><person><id>1</id></person> <person><id>2</id></person></people>"
            );

    @Before
    public void setUp() throws Exception {
        JAXBContext jaxbContext = JAXBContext.newInstance(Person.class);
        personUnmarshaller = jaxbContext.createUnmarshaller();
        xmlif = XMLInputFactory.newInstance();
    }

    @After
    public void cleanUp() throws Exception {
        if(xmlStreamReader != null) {
            xmlStreamReader.close();
        }
    }

    @XmlRootElement(name = "person")
    static class Person {
        String id;
    }

    @Test
    public void whenNoSpacesBetweenNodes_shouldFind2Persons_FAIL() throws Exception {
        xmlStreamReader = xmlif.createXMLStreamReader(xmlStreamPersonsNoSeparated, "UTF-8");

        int personTagsFound = countUnmarshalledPersonEntities();

        assertEquals(personTagsFound, 2);
    }

    /**
     * I don't know why, but if there's at least one whitespace character between node of the same type it won't skip.
     * 
     * @throws Exception in a test
     */
    @Test
    public void whenWithSpacesBetweenNodes_shouldFind2Persons_SUCCESS() throws Exception {
        xmlStreamReader = xmlif.createXMLStreamReader(xmlStreamWithPersonsWhitespaceSeparated, "UTF-8");

        int personTagsFound = countUnmarshalledPersonEntities();

        assertEquals(personTagsFound, 2);
    }

    /**
     * CODE to test.
     * 
     * @return number of unmarshalled persons (people).
     * @throws Exception
     */
    private int countUnmarshalledPersonEntities() throws Exception {
        int personTagsFound = 0;

        while (xmlStreamReader.hasNext()) {
            int type = xmlStreamReader.next();

            if (type == XMLStreamConstants.START_ELEMENT && xmlStreamReader.getName().toString().equalsIgnoreCase("person")) {
                personUnmarshaller.unmarshal(xmlStreamReader, Person.class);
                personTagsFound++;
            }
        }

        return personTagsFound;
    }
}

Are there any idea about what's the problem of the code?

Thank you.

1

There are 1 answers

0
javahippie On

thank you for your appended unit test, this really made understanding easier!

When you perform unmarshal on the xmlStreamReader, the XMLStreamReader will implicitly call next on its own as long as there are tags belonging to your entity. So after your closing person tag it will call next and point to the first person tag of the next entity. With your call to xmlStreamReader.next() in the next iteration, you skip it. This does not happen if there is whitespace between your entities, because after parsing, your reader points at the whitespace instead.

This modified code works for me, both of your unit tests succeed:

    while (xmlStreamReader.hasNext()) {
        if (xmlStreamReader.isStartElement() && xmlStreamReader.getName().toString().equalsIgnoreCase("person")) {
            personUnmarshaller.unmarshal(xmlStreamReader, Person.class);
            personTagsFound++;
        } else {
            xmlStreamReader.next();
        }
    }