Context
I need to parse an XML. This XML is big so I use StAx to process every element that I'm interested. I use the the default implementation that comes with the JDK.
Problem
When an XML element precedes another element the same type (for example <person>
) and there's no any character between them, it skips the second. So if I have 10 one after another I can only unmarshall 5 persons. For example:
<people><person>..</person><person>..</person></people>
I built a test to show this behaviour against a piece of code encapsulated in a method countUnmarshalledPersonEntities()
.
The thing is, when there're spaces between the elements like:
<people><person><id>1</id></person> <person><id>2</id></person></people>
It unmarshall two entities and that's OK.
But when there's no spaces between nodes like:
<people><person><id>1</id></person><person><id>2</id></person></people>
The first unmarshalling skips the next open tag <person>
, and then the second person is ignored. I only parse 1 entity.
Test
package org.opensource.lab.stream;
import static org.junit.Assert.assertEquals;
import java.io.InputStream;
import javax.xml.bind.JAXBContext;
import javax.xml.bind.Unmarshaller;
import javax.xml.bind.annotation.XmlRootElement;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;
import org.apache.commons.io.IOUtils;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class StreamParserProblemTest {
private XMLInputFactory xmlif;
private XMLStreamReader xmlStreamReader;
private Unmarshaller personUnmarshaller;
private final InputStream xmlStreamPersonsNoSeparated = IOUtils.toInputStream(
"<people><person><id>1</id></person><person><id>2</id></person></people>"
);
private final InputStream xmlStreamWithPersonsWhitespaceSeparated = IOUtils.toInputStream(
"<people><person><id>1</id></person> <person><id>2</id></person></people>"
);
@Before
public void setUp() throws Exception {
JAXBContext jaxbContext = JAXBContext.newInstance(Person.class);
personUnmarshaller = jaxbContext.createUnmarshaller();
xmlif = XMLInputFactory.newInstance();
}
@After
public void cleanUp() throws Exception {
if(xmlStreamReader != null) {
xmlStreamReader.close();
}
}
@XmlRootElement(name = "person")
static class Person {
String id;
}
@Test
public void whenNoSpacesBetweenNodes_shouldFind2Persons_FAIL() throws Exception {
xmlStreamReader = xmlif.createXMLStreamReader(xmlStreamPersonsNoSeparated, "UTF-8");
int personTagsFound = countUnmarshalledPersonEntities();
assertEquals(personTagsFound, 2);
}
/**
* I don't know why, but if there's at least one whitespace character between node of the same type it won't skip.
*
* @throws Exception in a test
*/
@Test
public void whenWithSpacesBetweenNodes_shouldFind2Persons_SUCCESS() throws Exception {
xmlStreamReader = xmlif.createXMLStreamReader(xmlStreamWithPersonsWhitespaceSeparated, "UTF-8");
int personTagsFound = countUnmarshalledPersonEntities();
assertEquals(personTagsFound, 2);
}
/**
* CODE to test.
*
* @return number of unmarshalled persons (people).
* @throws Exception
*/
private int countUnmarshalledPersonEntities() throws Exception {
int personTagsFound = 0;
while (xmlStreamReader.hasNext()) {
int type = xmlStreamReader.next();
if (type == XMLStreamConstants.START_ELEMENT && xmlStreamReader.getName().toString().equalsIgnoreCase("person")) {
personUnmarshaller.unmarshal(xmlStreamReader, Person.class);
personTagsFound++;
}
}
return personTagsFound;
}
}
Are there any idea about what's the problem of the code?
Thank you.
thank you for your appended unit test, this really made understanding easier!
When you perform
unmarshal
on thexmlStreamReader
, the XMLStreamReader will implicitly callnext
on its own as long as there are tags belonging to your entity. So after your closingperson
tag it will callnext
and point to the firstperson
tag of the next entity. With your call toxmlStreamReader.next()
in the next iteration, you skip it. This does not happen if there is whitespace between your entities, because after parsing, your reader points at the whitespace instead.This modified code works for me, both of your unit tests succeed: