Most effective and efficient way to store tree with co-indexed nodes in XML

274 views Asked by At

I am picking up an older project of mine where effectiveness and efficiency are key. I have 100's of GB of XML data that I parse. For each XML tree (millions of them) some XML attributes are used from which patterns are deducted. For this question, though, I shall simplify things greatly - but it is important to remember that there is a lot of data and that fast processing, and tidy storing of the results in XML is important. In addition, the resulting XML tree will need to be traversed as well. In fact, it will serve as a custom indexing mechanism used in BaseX but I'll come back to that later on.

From every tree (and its subtrees, but that's not important now) a pattern is created that is based on the root node's direct children. As a basic example, take the following XML tree:

<node letter="X">
  <node letter="A"/>
  <node letter="B"/>
  <node letter="C"/>
  <node letter="D"/>
</node>

The pattern is created by taking all the children's letter attributes and sorting them and joining them. In this case, the pattern would be ABCD.

For each tree, all possible subtrees are generated as well (order not important, min. combination 2), i.e., all possible combinations of the children, and their patterns are generated. I'm not including the XML of the combination trees, but the possible patterns, in addition to ABCD, would thus be:

AB
ABC
ABD
AC
ACD
AD
BC
BCD
BD
CD

In a hierarchy, they look like this (note the duplication in the terminal nodes).

tree view of patterns

In addition, the full pattern that we started from should be indicated somehow in the generated XML to indicate that it was the 'main' pattern in a tree. Ultimately, I want to recover all patterns derived from a given pattern (cf. later) and filter those to only the ones being 'main' patterns.

From a query point of view, you couldargue that I am doing a bottom-up look-up. If I am looking for a generalized pattern, e.g. AB, the more specific patterns should also be found becauseABis part ofABCD. So if I were to look for pattern AB with the data above, I would want to find

ABCD
ABC
ABD

It may be clear that there are two steps here. First, generalize one level up: AB -> ABC,ABD and then ABC->ABCD (and again ABD->ABDC but each result should be found only once of course). The intermediate steps ABC and ABD are important for me as well, not only the final ABCD result.

The concern that I am faced with, is how to usefully store a tree such as the one presented in the image so that it is 1. easy to query in the way that I discussed; 2. as sparse as possible without losing any nodes; 3. efficient to build. The latter point is less important for this question, as I will implement the building script myself - which will be done in Python 3.6.

The idea that I had up to now is to have quite a flat structure that works by 'coindexing' the immediate parent node. It would look like this:

<tree>
  <node pattern="AB" index="1">
    <parentnode coindex="7"/> <!-- ABC -->
    <parentnode coindex="8"/> <!-- ABD -->
  </node>
  <node pattern="AC" index="2">
    <parentnode coindex="7"/> <!-- ABC -->
    <parentnode coindex="9"/> <!-- ACD-->
  </node>
  <node pattern="AD" index="3">
    <parentnode coindex="8"/> <!-- ABD -->
    <parentnode coindex="9"/> <!-- ACD -->
  </node>
  <node pattern="BC" index="4">
    <parentnode coindex="7"/> <!-- ABC -->
    <parentnode coindex="10"/> <!-- BCD -->
  </node>
  <node pattern="BD" index="5">    
      <parentnode coindex="8"/> <!-- ABD -->
      <parentnode coindex="10"/> <!-- BCD -->
  </node>
  <node pattern="CD" index="6">    
      <parentnode coindex="9"/> <!-- ACD -->
      <parentnode coindex="10"/> <!-- BCD -->
  </node>
  <node pattern="ABC" index="7">    
    <parentnode coindex="11"/> <!-- ABCD -->
  </node>
  <node pattern="ABD" index="8">    
    <parentnode coindex="11"/> <!-- ABCD -->    
  </node>
  <node pattern="ACD" index="9">   
    <parentnode coindex="11"/> <!-- ABCD -->    
  </node>
  <node pattern="BCD" index="10">    
    <parentnode coindex="11"/> <!-- ABCD -->    
  </node>
  <node pattern="ABCD" index="11" isMain="true"/>
</tree>

By doing this, I guess one could get all patterns that are linked together. For instance, if I'd now look for AB, I'd hope to get to that nodes children (parentnode) and take those indexes and use those to look for the immediate parents. The process should then repeat until no elements are left with a coindex (e.g. ABCD in this case).

Imagine that there are thousands of XML elements like this, and that the main ones are indicated with isMain. Note that isMain is not necessarily a pattern that has no parentnode children! The result is an accumulation of all original XML trees. Meaning that in some cases a pattern may be the main one, whereas in others it may be part of the combinations. In such cases the node is indicated with isMain because in the original XML there is 'some tree that has this pattern as its main pattern'.

This is just my idea so far, but I am not sure if there is a better way or, more importantly, if this can be easily queried in BaseX. Basically with a given input AB I want to retrieve all relevant patterns (ABC, ABD, ABCD) by using the indexes and then filter these results by only retrieving the one(s) where isMain="true". I am looking forward to seeing better ideas, and ways to query them in BaseX. If my idea is a good one, then I would like to see a good way to capture the query in a single xquery expression.

1

There are 1 answers

3
dirkk On BEST ANSWER

I don't quite get what you are trying to achieve by putting your hierarchical data into a flat structure and than still using BaseX, an efficient XML processor.

I think it is much more natural (and faster) to put the data into the logical structure it already represents and use indexes to query the data efficiently.

So you simply use a hierarchical structure as-is, e.g.:

<tree>
    <node pattern="ABCD">
      <node pattern="ABC">
        <node pattern="AB"/>
        <node pattern="AC"/>
        <node pattern="BC"/>
      </node>
      <node pattern="ABD">
        <node pattern="AB"/>
        <node pattern="AD"/>
        <node pattern="BD"/>
      </node>
    </node>
</tree>

So, I would guess you chose that format because of performance reasons. But when you build your database you should create an attribute index, i.e. all your attribute values will be indexed. Normally, for the more usual queries it should be rewritten, but you can use the attribute index directly. For example, I have a 50GB+ database containing a dump of the english wikipedia. For example, I choose to search for a page List of Dragon Ball characters:

db:attribute("wikipedia", "List of Dragon Ball characters")/parent::*

This took roughly 20ms to execute on my machine.

So similarly you could simply search for the pattern and go up your tree:

db:attribute("<YOUR_DB>", "AB")/ancestor::node/@pattern => distinct-values()

Given that you first utilize the index and then simply go up the parents this should be as fast as it gets.