Note: possible GrapbDB bug (see comments)
I have this knowledge base in GraphDB:
PREFIX : <http://my_awesome_cats_collection#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
:foo a :cat ;
:name 'Marble' ;
owl:sameAs wd:Q27745011 .
# and many other cats
I tried this federated query
select * where {
# remote service
SERVICE <https://query.wikidata.org/sparql> {
?cat wdt:P463 ?membership
}
?cat :name ?name .
VALUES ?name {'Marble'}
}
and I got the expected results from Wikidata (i.e., Marble member of Musashi's).
If I switch the order of the patterns like this:
select * where {
?cat :name ?name .
VALUES ?name {'Marble'}
# remote service
SERVICE <https://query.wikidata.org/sparql> {
?cat wdt:P463 ?membership
}
}
I get many false positive results (i.e., data of other cats belonging to the Musashi's while I'd like to get just Marble. A kind of cross product between local and remote patterns, I guess).
In the official doc of SPARQL 1.1, they say:
Federated Query may use the VALUES clause to constrain the results received from a remote endpoint based on solution bindings from evaluating other parts of the query.
(the excerpt is informative. thanks to @TallTed for pointing this out)
So, when federating, can VALUES
only be used as a final filter? What is going on?
EDIT:
- Queries are performed with GraphDB
- It seems a bug of GraphDB query optimizer (thanks to: Stanislav Kralin)
The example you have posted demonstrate one of the corner cases of the SPARQL specification, which combines multiple related topics and are highly ambiguous in my opinion. The details below explain what are the taken assumptions and design decisions in the GraphDB engine. Please note that this might be different from the way other implementations read the following specification lines:
Interplay of SERVICE and VALUES
The SPARQL Federation 1.1 has a non-normative section describing what should be the behavior in this case:
GraphDB's query optimizer cannot retrieve any statistics from the remote SPARQL endpoint, so it takes the approach to throw naively the query to the remote SERVICE and join locally the results. Thus, the query optimization task is in the hands of the user who knows the schema in the two repositories by rearranging the query in a procedural way (see below).
Federated queries are sub-queries
Every remote query is treated as a sub-query and sent as it is to the external endpoint. Here is the equivalent syntax:
Sub-queries are evaluated first and all variables are propagated bottom-up
According to the SPARQL specification, no variable bindings should be pushed in the sub-query from the outside:
At this point, it's no longer possible to efficiently execute queries with a very selective local clause. That's why GraphDB database exposes a special configuration parameter to break the compliance with the SPARQL specification with:
./graphdb -Dreuse.vars.in.subselects
In this case, the query engine will ignore the SPARQL spec and will push the variable from the outer query inside the sub-select. Your correct version of the query after enabling this parameter is:
How the use should optimize the query execution plan of remote endpoints
VALUES/BIND are procedural and their place is significant according to the SPARQL specification
Another form of the same query much less efficient in this particular case is to first execute the remote endpoint query (i.e. download all results from Wikidata) and then join them with the local smaller dateset:
I hope this gives you the full picture around the GraphDB interpretation of the SPARQL specification and all possibilities how to optimize federated queries.