I am playing with Blazegraph. I insert some triples representing 'events', each of 'event' contains 3 triples and looks like this:
<%event-iri%> <http://predicates/timestamp> '2020-01-02T03:04:05.000Z'^^xsd:dateTime .
<%event-iri%> <http://predicates/a> %RANDOM_UUID% .
<%event-iri%> <http://predicates/b> %RANDOM_UUID% .
Timestamps represent consecutive moments of time, each next event is 1 minute later than the previous one.
I made two sets of tests: once having 1 million events (so 3 million triples), and once having 3 million events (9 million triples).
I run queries like the following:
select ?event ?a ?v
where {
?event <http://predicates/timestamp> ?timestamp .
filter (?timestamp >= '2020-01-02T03:04:05.000Z'^^xsd:dateTime && ?timestamp < '2020-01-02T03:03:05.000Z'^^xsd:dateTime)
?event ?a ?v .
}
I started with queries returning 1000 events (3000 triples) and then went down to queries that only match 1 event (and return 3 triples) to make sure that result data set size does not influence the range query performance itself too much.
I also tried adding a hint found here https://sourceforge.net/p/bigdata/discussion/676946/thread/2cf9a1e8/?limit=25 to tell Blazegraph that it should use range query optimization by adding the following
hint:Prior hint:rangeSafe "true" .
Right after the filter
clause.
Also, it was mentioned that for some types range queries do not work while working for others (for ints they worked for johpfe), so I also tried to do another set of tests where timestamps are represented as ints (Unix timestamps):
<%event-iri%> <http://predicates/timestamp> 1606528746 .
<%event-iri%> <http://predicates/a> %RANDOM_UUID% .
<%event-iri%> <http://predicates/b> %RANDOM_UUID% .
The final query I tried was
select ?event ?a ?v
where {
?event <http://predicates/timestamp> ?timestamp .
filter (?timestamp >= 1606528746 && ?timestamp < 1606528806)
hint:Prior hint:rangeSafe "true" .
?event ?a ?v .
}
Whatever I try, I get the following results: for the smaller dataset (1 million timestamps/ints) queries take 1 second, sometimes more, but not less; for the bigger dataset (3 million timestamps/ints) queries take at least 3 seconds.
The difference is 3x, which perfectly correlates with 3x change of data volume. So it looks like the range optimization is not working.
I also compared against MongoDB. Having an index on 'timestamp' field, it always executes an analogous query in 30-50ms, no matter on what data size.
What do I do wrong? Is there a way to make Blazegraph apply the optimization here?
PS. I also tried putting the hint right after a triple pattern, not filter
statement, as per https://github.com/blazegraph/database/wiki/QueryHints which says the following about rangeSafe
hint:
Declare that the data touched by the query for a specific triple pattern is strongly typed, thus allowing a range filter to be pushed down onto an index.
So the query became
select ?event ?a ?v
where {
?event <http://predicates/timestamp> ?timestamp .
hint:Prior hint:rangeSafe "true" .
filter (?timestamp >= 1606528746 && ?timestamp < 1606528806)
?event ?a ?v .
}
But this query finds nothing, so the hint just breaks it.
Here are the queries where the optimization does work.
This one is for the case of integers:
And this one is for the case when timestamps are date-times:
What prevented the optimization from kicking in was:
?timestamp >= 1606528746
instead of?timestamp >= "1606528746"^^xsd:int
. Strangely enough, it breaks the optimization.Also, it turned out that it is not important whether the hint contains
true
or"true"
: both options work successfully.Many thanks to @StanislavKralin for giving a working example using which I was able to transform my queries to a working form.