So I am doing some data cleaning, on a series of XML documents using StAX. I want to essentially read in the document and spit out the exact same document with a few of tags missing. The problem I'm having is that I'm not outputting valid XML.
You can see my output on the left, and original doc on the right [here] (https://i.stack.imgur.com/aOptO.jpg). The image on the bottom is also the output from xmllint -valid. As you can see it says, that there is no DTD found, and that there's extra content at the end of the document.
My code to implement the writer is this
public XMLEventWriter setUpWriter(File blah) throws FileNotFoundException, XMLStreamException {
newFileName = thef.getName().substring(0, thef.getName().indexOf("_") + 1);
try {
writer = outputFactory
.createXMLEventWriter(new FileOutputStream(newFileName + "mush.xml"), "UTF-8");
} catch (XMLStreamException ex) {
ex.printStackTrace();
System.out.println("There was an XML Stream Exception, whatever that means for writer");
}
//outputFactory.setProperty("escapeCharacters", false);
eventFactory = XMLEventFactory.newInstance();
StartDocument startDocument = eventFactory.createStartDocument();
writer.add(startDocument);
//writer.add("<!DOCTYPE DjVuXML>");
return writer;
}
This is my code that handles the actual writing.
if (event.isStartElement()) { //first it looks for start elements
StartElement se = event.asStartElement();
if ("OBJECT".equals(se.getName().getLocalPart())) {
writer.add(se);
} else if ("MAP".equals(se.getName().getLocalPart())) {
writer.add(se);
} else if ("PARAM".equals(se.getName().getLocalPart())) {
writer.add(se);
} else if ("LINE".equals(se.getName().getLocalPart())) {
writer.add(se);
} else if ("DjVuXML".equals(se.getName().getLocalPart())) {
writer.add(se);
}else if ("WORD".equals(se.getName().getLocalPart())) {
word.text = reader.getElementText();
EndElement wordEnd = eventFactory.createEndElement("", "", "WORD");
writer.add(se);
Characters characters = eventFactory.createCharacters(word.text);
writer.add(characters);
writer.add(wordEnd);
}
}
} else if (event.isEndElement()) {
EndElement ee = event.asEndElement();
if ("MAP".equals(ee.getName().getLocalPart())) {
writer.add(ee);
} else if ("DjVuXML".equals(ee.getName().getLocalPart())) {
writer.add(ee);
} else if ("LINE".equals(ee.getName().getLocalPart())) {
writer.add(ee);
}
else if ("BODY".equals(ee.getName().getLocalPart())) {
writer.add(ee);
}
}
}
writer.flush();
writer.close();
Now that we've got that out of the way my question is twofold:
1) Is my output not valid because it lacks the DTD?
1a) if Yes how do I include the DTD? Even if No tell me, this has been bothering me
2)If its not the DTD then how the heck do I get this thing valid.
Thanks for your help!!
Short answer: in theory, maybe yes and maybe no; in practice, yes.
In the XML spec, validity is defined thus:
Some readers take that to mean that a document is valid against a DTD if and only if the document obeys the constraints in the DTD. In that sense, a document without a document type declaration can be valid against some specified DTD, and a document with a document type declaration can be valid against the DTD specified in its document type declaration, or against any other specified DTD. Or not valid, as the case may be.
Other readers take this definition to mean that a document cannot be valid (at least, in the strict sense) unless it has a document type declaration, and that the question of validity only makes sense with respect to the document type definition specified by the document's document type declaration.
In practice, unless you tell a validating parser where to find the DTD to validate against, the parser has no choice but to take the second, more restrictive view. How can it validate the document if it can't find the DTD? (Some validating parsers accept run-time parameters for pointing to the DTD, others do not.)
From the JavaDocs for the StAX reference implementation, it looks as if
writeDTD(string)
were your friend.If you're getting a message about "exta content" it seems likely that your output is not only not valid but not well-formed. Check and fix that first.
The likely cause of an "extra content" error message is that you either closed your root element prematurely, or you don't have a root element at all.