Geomesa Query Performance

756 views Asked by At

Geomesa is a spatial temporal database, more details are available here: http://www.geomesa.org/

I am trying the example tutorial, by setting up Hbase database with it. I am running the Hbase QuickStart tutorial http://www.geomesa.org/documentation/tutorials/geomesa-quickstart-hbase.html The tutorial runs fine, below are some of the problems which I notice in the query performance of bounding box.

I have inserted data with lat, lng range (30,60) to (35,65)

In this settings, I am doing query on my local machine:
a) In my first query, the location bounding box is: (30,60) to (30.1,60.1), it runs on an average in less than a second and return correct results.
b) In second query, I modified the location bounding box (10,10) to (30.1,60.1). This query also returns the same results as in query (a), which is expected, but on an average it takes around 3-4 seconds per query.

Since both queries should give me same results, but one is running much faster than the other. I notice the similar behavior in time domain queries too where the performance is even much worse (10x times slower or even more) if time ranges are not matching with data inserted. Below are some of my questions:
1) Is this expected behavior ?
2) I know one of the solution can be to reformat the query to map to the actual data spatial and temporal ranges inserted into Geomesa, which will require me to maintain additional metadata about the data. But, I think a better solution might be designed at Geomesa layer ?

Do, let me know if there is some kind of settings etc, which can affect this behavior. I have seen the same behavior on multiple other local machines and on cloud VMS by setting up Geomesa.

1

There are 1 answers

1
Emilio Lahr-Vivaz On BEST ANSWER

In general, GeoMesa still has to scan where there might be data, even if there isn't actually any data there. Opening a scan, even if it returns no data, takes some time. For temporal queries, the number of ranges tends to be even larger, hence the slower performance.

I believe that Accumulo handles this a bit better than HBase, as it has a concept of a batch scanner that accepts multiple ranges, and it has some knowledge of the data start/end. For HBase, GeoMesa has to run multiple scans using a thread pool, so it's not as efficient.

GeoMesa also has the concept of data statistics, but it hasn't been implemented for HBase yet, and it's not currently used in query planning.

To mitigate the issue, you can try increasing the "queryThreads" data store parameter, in order to use more threads during queries. You can also enable "looseBoundingBox", if you have currently disabled it. For temporal queries, increasing the temporal binning period may cause fewer ranges to be scanned. However this may result in slower queries for very small temporal ranges, so it should be tailored to your use case.

As a final note, make sure that you have the distributed coprocessors installed and enabled, especially if you are not using loose bounding boxes.