While validating against XSD, find the exact missing element in the XML using any of DOM, StAX, SAX parsers

1.6k views Asked by At

I have a XML file and its corresponding XSD file. While validating using StAX parser, I've attached an error handler. Basically, I encounter two types of errors in a well-formed XML file.

1) Incorrect type of data inside an element, for e.g string inside an element that is supposed to have an integer.

2) Missing element: An element that must be present according to XSD is not present in the XML.

Using a StAX parser and custom error handler, I'm able to rectify the first type of error. But for the second type, a CHARACTER event is triggered and the value of the TEXT is the value of immediate next element. I don't know how to figure out the missing element. Also, why the CHARACTER event is triggered and the missing element is completely ignored?

As the StAX parser is forward only, is there a way to rectify both of the errors using other parsers?

import java.io.File;
import java.io.IOException;
import javax.xml.XMLConstants;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamSource;
import javax.xml.validation.*;
import javax.xml.validation.Validator;
import org.xml.sax.ErrorHandler;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;

public class XMLValidation {

    public static void main(String[] args) {

        XMLValidation xmlValidation = new XMLValidation();
        System.out.println(xmlValidation.validateXMLSchema("PHSHumanSubjectsAndClinicalTrialsInfo-V1.0.xsd", "FullPHSHuman.xml"));
    }

    public boolean validateXMLSchema(String xsdPath, String xmlPath){

        try {
            SchemaFactory factory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
            Schema schema = factory.newSchema(new File(xsdPath));
            StreamSource XML = new StreamSource(xmlPath);
            XMLStreamReader reader = XMLInputFactory.newFactory().createXMLStreamReader(XML);
            Validator validator = schema.newValidator();
            validator.setErrorHandler(new MyErrorHandler(reader));
            validator.validate(new StAXSource(reader));
        } catch (IOException | SAXException | XMLStreamException e) {
            System.out.println("Exception: "+e.getMessage() + " local message " + e.getLocalizedMessage() + " cause " + e.getCause());
            return false;
        }
        return true;
    }
}

class MyErrorHandler implements ErrorHandler {

    private XMLStreamReader reader;

    public MyErrorHandler(XMLStreamReader reader) {
        this.reader = reader;
    }

    @Override
    public void error(SAXParseException e) throws SAXException {
        System.out.println("error");
        warning(e);
    }

    @Override
    public void fatalError(SAXParseException e) throws SAXException {
        System.out.println("fatal error");
        warning(e);
    }

    @Override
    public void warning(SAXParseException e) throws SAXException {
        if(reader.getEventType() == 1 || reader.getEventType() == 2) {
            //The first type of error is detected here.
            System.out.println(reader.getLocalName());
            System.out.println(reader.getNamespaceURI());

        }

        if(reader.getEventType() == XMLStreamConstants.CHARACTERS) {
            int start = reader. getTextStart();
            int length = reader.getTextLength();
            System.out.println(new String(reader.getTextCharacters(), start, length));
        }
    }
}

Below is the snippet of the well-formed XML file:

<?xml version="1.0" encoding="UTF-8"?>
<PHSHumanSubjectsAndClinicalTrialsInfo:PHSHumanSubjectsAndClinicalTrialsInfo xmlns:PHSHumanSubjectsAndClinicalTrialsInfo="http://apply.grants.gov/forms/PHSHumanSubjectsAndClinicalTrialsInfo-V1.0" PHSHumanSubjectsAndClinicalTrialsInfo:FormVersion="1.0"
>
<!--    <PHSHumanSubjectsAndClinicalTrialsInfo:HumanSubjectsIndicator
    >Y: </PHSHumanSubjectsAndClinicalTrialsInfo:HumanSubjectsIndicator
    >-->
    <PHSHumanSubjectsAndClinicalTrialsInfo:HumanSubjectsIndicator1
    >Y: Yes</PHSHumanSubjectsAndClinicalTrialsInfo:HumanSubjectsIndicator1
    >
    <PHSHumanSubjectsAndClinicalTrialsInfo:HumanSubjectsIndicator2
    >Y: Yes</PHSHumanSubjectsAndClinicalTrialsInfo:HumanSubjectsIndicator2
    >

Here the HumanSubjectsIndicator element is commented to provoke the second scenario. In this case a CHARACTER event is triggered in the 'MyErrorHandler'. The value 'Y:Yes' is obtained reader.getTextCharacters(). This value corresponds to the HumanSubjectsIndicator1 element (found this using the getLocation() method).

Is there a way to get exactly the Local Name of the missing element. If not using StAX, then using other parsers?

Thanks.

1

There are 1 answers

3
Michael Kay On BEST ANSWER

The Saxon XSD validator gives you a message like this when a required element is missing:

Validation error on line 12 column 17 of books.xml:
  FORG0001: In content of element <ITEM>: The content model does not allow element <PRICE>
  to appear immediately after element <PUB-DATE>. It must be preceded by <LANGUAGE>. 
  See http://www.w3.org/TR/xmlschema-1/#cvc-complex-type clause 2.4

You could try pattern-matching the error message and extracting the name of the missing element.

The reason most schema processors don't give you this information is because of the way they work internally. Typically the schema processor constructs a finite state machine which indicates, for each element in the input, which elements are allowed to come next. If the next element isn't one of those permitted, it's not immediately obvious from the FSM why this is this case. Saxon does some extra analysis to try and improve the diagnostics: if the input contains a disallowed transition from A to C, then it searches the FSM to discover that there are permitted transitions from A to B and from B to C, and constructs an error message to say that B was missing.