how to get specific element's value from a large XML

1k views Asked by At

I am a beginner in JAVA SAX. I have a large XML file and I want to extract some information from it. below is the XML file, what I want to extract and the code:

Extract from the XML file:

    ...
    <Synset baseConcept="3" id="mizaAj_n2AR">
          <SynsetRelations>
            <SynsetRelation relType="hyponym" targets="TaboE_n2AR"/>
            <SynsetRelation relType="hyponym" targets="TaboE_n2AR"/>
            <SynsetRelation relType="hypernym" targets="ragobap_n4AR"/>
            <SynsetRelation relType="hypernym" targets="ragobap_n4AR"/>
            <SynsetRelation relType="hypernym" targets="Tiybap_Aln~afos_n1AR"/>
            <SynsetRelation relType="hypernym" targets="Tiybap_Aln~afos_n1AR"/>
          </SynsetRelations>
          <MonolingualExternalRefs>
            <MonolingualExternalRef externalReference="04623612-n" externalSystem="PWN30"/>
          </MonolingualExternalRefs>
        </Synset>
        <Synset baseConcept="3" id="ragobap_n4AR">
          <SynsetRelations>
            <SynsetRelation relType="antonym" targets="mizaAj_n2AR"/>
            <SynsetRelation relType="antonym" targets="mizaAj_n2AR"/>
          </SynsetRelations>
          <MonolingualExternalRefs>
            <MonolingualExternalRef externalReference="04624826-n" externalSystem="PWN30"/>
          </MonolingualExternalRefs>
        </Synset>
        <Synset baseConcept="3" id="tasal~uT_n1AR">
          <SynsetRelations>
            <SynsetRelation relType="has_instance" targets="simap_n1AR"/>
            <SynsetRelation relType="is_instance" targets="simap_n1AR"/>
          </SynsetRelations>
          <MonolingualExternalRefs>
            <MonolingualExternalRef externalReference="04625882-n" externalSystem="PWN30"/>
          </MonolingualExternalRefs>
        </Synset>
...

I want:

hyponym: 2
hypernym: 4
antonym: 2 
has_instance: 1
is_instance:1

The code (the main class and my handler):

    import java.io.IOException;
    import org.xml.sax.SAXException;
    import org.xml.sax.XMLReader;
    import org.xml.sax.helpers.XMLReaderFactory;

    public class Main {

        public static void main(String[] args) throws SAXException, IOException{

            XMLReader p = XMLReaderFactory.createXMLReader();
            p.setContentHandler(new handler());
            p.parse("test1.xml");
}
   ----------------------------------------
import org.xml.sax.helpers.DefaultHandler;

    public class handler extends DefaultHandler {

        @Override
        public void startElement(String SpacenameURI, String localName,
                String qName, Attributes attrs) {

            System.out.println("qname = " + qName);
            String node = qName;

            if (attrs != null) {
                for (int i = 0; i < attrs.getLength(); i++) {
                    //nous récupérons le nom de l'attribut
                    String aname = attrs.getLocalName(i);
                    //Et nous affichons sa valeur
                    System.out.println("Attribut " + aname + " valeur : " + attrs.getValue(i));
                }
            }
        }
    }
1

There are 1 answers

6
Geoffrey De Vylder On BEST ANSWER
public Map<String, Integer> countElements(File xmlFile) {

    Map<String, Integer> counts = new HashMap<>();

    try {
        XMLInputFactory inputFactory = XMLInputFactory.newInstance();
        FileInputStream fileInputStream = new FileInputStream(xmlFile);
        XMLStreamReader reader = inputFactory.createXMLStreamReader(fileInputStream);

        while(reader.hasNext()) {
            reader.next();
            if(reader.isStartElement() && reader.getLocalName().equals("SynsetRelation")) {
                String relTypeValue = reader.getAttributeValue("", "relType");

                if(!counts.containsKey(relTypeValue)) {
                    counts.put(relTypeValue, 0);
                }

                counts.put(relTypeValue, counts.get(relTypeValue) + 1);
            }
        }

        fileInputStream.close();
    } catch (XMLStreamException | IOException e) {
        e.printStackTrace();
    }

    return counts;
}

This code uses a Stream reader, meaning it will only load one element at a time in memory. This makes it efficient, even for large files.

A map is used to keep track of the counts. Every time I encounter a "SynsetRelation" element I check first to see if it is already counted, then I increment the counter.

The result is map containing the counts per detected value.

You would use it like this in your main class:

public class Main {
    public static void main(String[] args) {
        Map<String, Integer> results = countElements(new File("your file location here.xml"));
    }  
}