Selecting triples from specific graph in MarkLogic7

167 views Asked by At

I need to provide isolation between similar triples in different graphs (collections) in MarkLogic. For this to work I have to specify which graph I want the triples to be retrieved from, and my approach is this:

cts:triples(
  (),
  sem:iri("http://something/predicate#somepredicate"), "SomeObject", (), (),
  cts:collection-query("someCollection") )  

This works, but it performs poorly because of the collection-query. Are there any better ways to limit results to only these of a given graph?

2

There are 2 answers

1
mblakele On BEST ANSWER

I tried to create a test case for this, using 7.0-4 on my laptop. It seems pretty fast to me: take a look and see where it's different from what you're doing. My guess is that your query returns many triples, and that's the bottleneck. Matching triples is very fast, but returning large numbers of them can be relatively slow.

First let's use taskbot to generate some triples.

(: insert test documents with taskbot :)
import module namespace tb="ns://blakeley.com/taskbot"
  at "src/taskbot.xqm" ;
import module namespace sem="http://marklogic.com/semantics" 
  at "MarkLogic/semantics.xqy";

tb:list-segment-process(
  (: Total size of the job. :)
  1 to 1000 * 1000,
  (: Size of each segment of work. :)
  500,
  (: Label. :)
  "test/triples",
  (: This anonymous function will be called for each segment. :)
  function($list as item()+, $opts as map:map?) {
    (: Any chainsaw should have a safety. Check it here. :)
    tb:maybe-fatal(),
    let $triples := $list ! sem:triple(
      sem:iri("subject"||xdmp:random()),
      sem:iri("predicate"||xdmp:random(19)),
      "object"||xdmp:random(49),
      sem:iri('graph'||xdmp:random(9)))
    return sem:rdf-insert($triples)
    ,
    (: This is an update, so be sure to commit each segment. :)
    xdmp:commit() },
  (: options - not used in this example. :)
  map:new(map:entry('testing', '123...')),
  (: This is an update, so be sure to say so. :)
  $tb:OPTIONS-UPDATE)

Now, taskbot does most of the work on the Task Server. So monitor ErrorLog.txt or just wait for the CPU to go down and the triple count to hit 1M. After that, let's see what we loaded:

count(cts:triples()),
count(cts:triples((), sem:iri("predicate0"))),
count(cts:triples((), (), "object0")),
count(
  cts:triples((), (), (), (), (), cts:collection-query("graph0")))
=>
1000000
49977
19809
100263

You might get a different counts for the predicate, object, and collection: remember that the data was generated randomly. But let's try a query.

count(
  cts:triples(
    (), sem:iri("predicate0"), "object0",
    (), (), cts:collection-query("graph0")))
, xdmp:elapsed-time()

Results:

100
PT0.004991S

That seems pretty fast to me: 5-ms. You might get a different count because the data was generated randomly, but it should be close.

Now, a larger result set will slow this down. For example:

count(
  cts:triples(
    (), (), (),
    (), (), cts:collection-query("graph0")))
, xdmp:elapsed-time()
=>
100263
PT0.371252S

count(cts:triples())
, xdmp:elapsed-time()
=>
1000000
PT2.906235S

count(cts:triples()[1 to 1000])
, xdmp:elapsed-time()
=>
1000
PT0.002707S

As you can see, the response time is roughly O(n) with the number of triples. Actually it's a little better than O(n), but in that ballpark. In any case the cts:collection-query doesn't look like the problem.

0
grtjn On

I'd be surprised if the collection-query would be the poorly performing part. Don't be misguided by the mere return of many results could make it seem slow. Put things in a count or xdmp:estimate to exclude that.

Apart from cts:triples, I can only think of sem:sparql with FROM or GRAPH statements..

HTH!