Jena ARQ/TDB Query Optimization

1.8k views Asked by At

I have a rather small graph containing roughly 500k triples. I've also generated the stats.opt file and running my code on a rather fast computer (quad core, 16gb ram, ssd drive.) But for the query I'm building with the help of the OP interface, it takes forever to iterate over the resultset. The resultset has about 15000 lines and the iteration takes 4s, which is unacceptable for endusers. Executing the query takes merely 90ms (I guess the real work is done by the cursor iteration?). Why is this so slow and what can I do to speed up the result set iteration?

Here is the query:

SELECT  ?apartment ?price ?hasBalcony ?lat ?long ?label ?hasImage ?park ?supermarket ?rooms ?area ?street
WHERE
  { ?apartment dssd:hasBalcony ?hasBalcony .
    ?apartment wgs84:lat ?lat .
    ?apartment wgs84:long ?long .
    ?apartment rdfs:label ?label .
    ?apartment dssd:hasImage ?hasImage .
    ?apartment dssd:hasNearby ?hasNearbyPark .
    ?hasNearbyPark dssd:hasNearbyPark ?park .
    ?apartment dssd:hasNearby ?hasNearbySupermarket .
    ?hasNearbySupermarket dssd:hasNearbySupermarket ?supermarket .
    ?apartment dssd:price ?price .
    ?apartment dssd:rooms ?rooms .
    ?apartment dssd:area ?area .
    ?apartment vcard:hasAddress ?address .
    ?address vcard:streetAddress ?street
    FILTER ( ?hasBalcony = true )
    FILTER ( ?price <= 1000.0e0 )
    FILTER ( ?price >= 650.0e0 )
    FILTER ( ?rooms <= 4.0e0 )
    FILTER ( ?rooms >= 3.0e0 )
    FILTER ( ?area <= 100.0e0 )
    FILTER ( ?area >= 60.0e0 )
  }    

(Is there a better way to query those bnodes: ?hasNearbyPark, ?hasNearbySupermarket)

And the code to execute the query:

dataset.begin(ReadWrite.READ);
Model model = dataset.getNamedModel("http://example.com");
QueryExecution queryExecution = QueryExecutionFactory.create(buildQuery(), model);
ResultSet resultSet = queryExecution.execSelect();
while ( resultSet.hasNext() ) {
    QuerySolution solution = resultSet.next(); ...
1

There are 1 answers

7
RobV On

On the ARQ Query Engine

First off you seem to be misunderstanding how the ARQ engine works:

ResultSet resultSet = queryExecution.execSelect();

All the above does is prepare a query plan for how the engine will evaluate the query, it does not actually evaluate the query hence why it is almost instantaneous.

Actual work on answering your question does not happen until you start calling hasNext() and next():

while ( resultSet.hasNext() ) {
   QuerySolution solution = resultSet.next(); ...

So the timings you quote are incorrect, the query takes 4s to evaluate because that is how long it takes to iterate over all results.

On your actual question

You haven't shown what your buildQuery() method does but you say you are building the query as a Op structure programmatically rather than as a string? If this is the case then the query engine may not actually be applying optimization though off the top of my head I don't think this will be the issue. You can try adding an op = Algebra.optimize(op); before you return the built Op but I don't know that this will make much difference.

It looks like the optimizer should do a good job just given the raw query (not that your query has much scope for optimization other than join reordering) but if you are building it programmatically then you may be building an unusual algebra which the optimizer struggles with.

Similarly I'm not sure if you stats.opt file will be honored because you query over a specific model rather than the TDB dataset so the query engine might be the general purpose rather than the TDB engine. I'm not an expert in TDB so I can't tell if this is the case or not.

Bottom Line

In general there is not enough information in your question to diagnose if there is an actual issue in your setup or if your query is just plain expensive. Reporting this as a minimal test case (minimal complete code plus sample data) to the [email protected] list for further analysis would be useful.

As a general comment on your query lots of range filters are expensive to perform which is likely where most of the time goes.