Proper usage of VALUES in federated queries

529 views Asked by At

Note: possible GrapbDB bug (see comments)

I have this knowledge base in GraphDB:

PREFIX : <http://my_awesome_cats_collection#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>


:foo a :cat ;
     :name 'Marble' ;
     owl:sameAs wd:Q27745011 .
# and many other cats

I tried this federated query

select * where { 
    # remote service
    SERVICE <https://query.wikidata.org/sparql> {
        ?cat wdt:P463 ?membership
    }

    ?cat :name ?name .
    VALUES ?name {'Marble'}

} 

and I got the expected results from Wikidata (i.e., Marble member of Musashi's).

If I switch the order of the patterns like this:

select * where { 

    ?cat :name ?name .
    VALUES ?name {'Marble'}

    # remote service
    SERVICE <https://query.wikidata.org/sparql> {
        ?cat wdt:P463 ?membership
    }
} 

I get many false positive results (i.e., data of other cats belonging to the Musashi's while I'd like to get just Marble. A kind of cross product between local and remote patterns, I guess).

In the official doc of SPARQL 1.1, they say:

Federated Query may use the VALUES clause to constrain the results received from a remote endpoint based on solution bindings from evaluating other parts of the query.

(the excerpt is informative. thanks to @TallTed for pointing this out)

So, when federating, can VALUES only be used as a final filter? What is going on?

EDIT:

  • Queries are performed with GraphDB
  • It seems a bug of GraphDB query optimizer (thanks to: Stanislav Kralin)
1

There are 1 answers

3
vassil_momtchev On BEST ANSWER

The example you have posted demonstrate one of the corner cases of the SPARQL specification, which combines multiple related topics and are highly ambiguous in my opinion. The details below explain what are the taken assumptions and design decisions in the GraphDB engine. Please note that this might be different from the way other implementations read the following specification lines:

Interplay of SERVICE and VALUES

The SPARQL Federation 1.1 has a non-normative section describing what should be the behavior in this case:

Implementers of SPARQL 1.1 Federated Query may use the VALUES clause to constrain the results received from a remote endpoint based on solution bindings from evaluating other parts of the query.

GraphDB's query optimizer cannot retrieve any statistics from the remote SPARQL endpoint, so it takes the approach to throw naively the query to the remote SERVICE and join locally the results. Thus, the query optimization task is in the hands of the user who knows the schema in the two repositories by rearranging the query in a procedural way (see below).

Federated queries are sub-queries

Every remote query is treated as a sub-query and sent as it is to the external endpoint. Here is the equivalent syntax:

# remote service
SERVICE <https://query.wikidata.org/sparql> {
    SELECT ?cat ?membership {
        ?cat wdt:P463 ?membership
    }
    LIMIT <put any limit>
}

Sub-queries are evaluated first and all variables are propagated bottom-up

According to the SPARQL specification, no variable bindings should be pushed in the sub-query from the outside:

Subqueries are a way to embed SPARQL queries within other queries, normally to achieve results which cannot otherwise be achieved, such as limiting the number of results from some sub-expression within the query.

Due to the bottom-up nature of SPARQL query evaluation, the subqueries are evaluated logically first, and the results are projected up to the outer query.

Note that only variables projected out of the subquery will be visible, or in scope, to the outer query.

At this point, it's no longer possible to efficiently execute queries with a very selective local clause. That's why GraphDB database exposes a special configuration parameter to break the compliance with the SPARQL specification with:

./graphdb -Dreuse.vars.in.subselects

In this case, the query engine will ignore the SPARQL spec and will push the variable from the outer query inside the sub-select. Your correct version of the query after enabling this parameter is:

PREFIX : <http://my_awesome_cats_collection#>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

select * where {
    
    ?cat :name ?name .
    VALUES ?name {
        'Marble'
    }
    
    # remote service
    SERVICE <https://query.wikidata.org/sparql> {
        ?cat wdt:P463 ?membership
    }
}

How the use should optimize the query execution plan of remote endpoints

VALUES/BIND are procedural and their place is significant according to the SPARQL specification

The BIND form allows a value to be assigned to a variable from a basic graph pattern or property path expression. Use of BIND ends the preceding basic graph pattern. The variable introduced by the BIND clause must not have been used in the group graph pattern up to the point of use in BIND.

Another form of the same query much less efficient in this particular case is to first execute the remote endpoint query (i.e. download all results from Wikidata) and then join them with the local smaller dateset:

PREFIX : <http://my_awesome_cats_collection#>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

select * where {
    
    # remote service
    SERVICE <https://query.wikidata.org/sparql> {
        ?cat wdt:P463 ?membership
    }

    ?cat :name ?name .
    VALUES ?name {
        'Marble'
    }
}

I hope this gives you the full picture around the GraphDB interpretation of the SPARQL specification and all possibilities how to optimize federated queries.