Big data with spatial queries/indexing

933 views Asked by At

Currently I am using SQL Server and having issues with scaling an unrelated 100 million records doing about a million writes every day, 32 GB of ram used up and CPU at 80%-90% most of the time. This makes me weary of staying with SQL server, but if possible I'd like to stay with it. I've been looking into mongoDB.

I have a new project and it will require to store about 100 million spatial records, all as polylines and with 15 or so attributes associated with the geometry. I know SQL Server and mongoDB both support spatial indexing, with mongoDB using an inferior (as I read) Geohash compared to R-tree index.

I think they balance out in terms of performance for the GIS data because I feel even though mongoDB has an inferior Spatial index, it will be made up in terms of just shear performance of its read speed compared to SQL Server.

The real problem that I have is for every single polyline, there will be time data that is tied to each polyline. The number is anywhere from 20 to 2000 depending on how well it can scale. That's 2 billion now and potential to grow to 200 billion.

The polyline data will not exceed 100 million, and will be about 1KB per record (100 GB). If we store all the data unnormalized, and don't care about duplicating the GIS data in effort to avoiding doing a JOIN, that is 2-200 TB, basically not manageable by me.

Therefore I think some de-normalization is needed, GIS data in one table/collection and associated timedata in another.

At a moments notice, a map request will come in and request a tile which should query all the GIS information for that bounding box (think spatial intersection) and using this result, timedata AVERAGE over a time range needs to be queried for the selected polylines. The dataset has to be together (JOINED) when it hits the map renderer. All this will happen 12-20 times a second as a map is being panned, as the map will color the polylines according to the timedata.

My question is, given mongoDB spatial indexing performance, will going from 100 million to say 250,000 records be an issue when using geoIntersects?

Then once the 250,000 polylines are found, I would need to query the timedata for the time assoicated with the 250,000 polylines for a certain WHERE clause, most likely a time range. Can mongoDB accomplish this given the table will contain over 2 billion records, and do it sub-second?

Right now I am able to go from 2 million to 200,000 polylines in SQL Server 2012 at around 4 seconds using a spatial index. This is acceptable, but it's not taking into considering the timedata at all and its 50 times less data then there will be.

I feel like doing a JOIN operation with mongoDB will defeat the purpose of mongoDB and not yield better performance than SQL Server.

What is the database recommendations for accomplishingly this task?

Key Points:

  • Support spatial indexes in order to correctly query the GIS data.

  • Data will only be written annually, basically 100% reads.

  • most queries on timedata will need an AVERAGE within a timerange

  • Low load, only 2-10 users connected at any given time

The budget is around $1000 a month for the server/server(s).

Edit:

The time data consists of the reported mph at that segment of a road in 15 minute intervals. A user-search would be "I want to see the average speeds of this road on fridays, in the past 3 months"

The map engine then renders it according to a legend based on average speeds. The map engine needs to know what color each road/polyline will be, so if there are X roads being viewed on the map, it needs X values along with the X polylines.

2

There are 2 answers

14
AaronLS On

Data will only be written annually, basically 100% reads.

...most queries on timedata will need an AVERAGE within a timerange

Those two things, 100% reads and an aggregation sounds like a data warehouse/star structure would be worth exploring. There's alot of concepts to understand to properly build such a structure, but we can hit on a potential design.

The time data consists of the reported mph at that segment of a road in 15 minute intervals. A user-search would be "I want to see the average speeds of this road on fridays, in the past 3 months"

When you say in 15 minute intervals, I am assuming we might have 5 people who passed through that segment between 1:15 PM - 1:30 PM, and so have 5 records for that time period.

This is going to be an uncomfortable exercise for anyone who has never built a data warehouse before, but as someone who was very skeptical of these approaches and have put them into practice, I've seen that you can get a huge performance boost. In other words, I am a skeptical convert. Usually you keep your normalized database for operational transactions, then nightly/weekly populate your data warehouse from it.

It is important to know what types of queries you will be hitting this with, as we design the star structure to lend itself to them. It doesn't severely constrain the queries though, and you still have alot of flexibility. There's lots of generic analytic/OLAP tools that build upon star structures, and their flexibility is evidence of this.

Date/Time Dimensions First thing we'll do is create a time and date dimension. Each row in the time dimension represents a 15 minute interval. I would document somewhere start/end inclusive/exclusive so any times that are on the fence are clearly included/excluded. It only has 96 rows, one for each 15 minute interval in a day.

Id,StartTime(inclusive),EndTime(exclusive)
 1, 0:00, 0:15
 2, 0:15, 0:30
...
95,23:30,23:45
96,23:45,24:00

The date dimension can be designed a couple different ways. To maximize analytical flexibility, we usually have a row for every real day that our data covers. This seems ridiculous to anyone with a normalize DB design background, but it's pretty standard practice in data warehouses, and there literally are entire chapters in Data Warehouse books that explain why. There are scripts out there that help you generate the entries in a date dimension. If your data covers 2000 and you plan to reload the database over the next few years, then you would create entries for every single day from 2000 to 2020 which is only 7300 rows(20 years*365 days). Consider this can easily be cached in a very small amount of memory.

Id,Date(date),Year(smallint),Month(tinyint),Day(tinyint),MonthName,MonthAbbreviation,DayOfWeekNumber(tinyint),DayOfWeekName....
1000,2015,5,15,... 
1001,2015,5,16,...
1002,2015,5,17,...

The reason for all of the extra columns such as DayOfWeekNumber and DayOfWeekName is to support very straightforward aggregations on those properties or combinations. Makes groupby DayOfWeekNumber very easy so you can do trend analysis in different ways.

Poly Dimension For a poly dimension, you'd have one row per road segment. I made this choice because multiple time entries would share a single poly segment, and thus we'd want polyIds in the fact table below to group on.

Speed Facts Table This table is going to be the one with the huge number of records. Each row in your fact table should be as tiny as possible. This maximizes I/O throughput, speed of aggregation, and ability to cache as much as possible in memory.

For example, DateId should be smallint since 2 bytes is enough to represent 32767 IDs, much more than the 7300 needed for 20 years of data with room for 60 more years. TimeId would be a tinyint. People will say storage is cheap, but that's not the driving factor. I/O throughput and cache utilization are why small row size is important(and thus rows per page).

RoadSegmentId, TimeId, DateId, Speed
1,1,1,45
1,1,1,47
1,1,1,92
1,2,1,55
1,2,1,67
1,2,1,91
2,1,1,55
2,2,1,67
2,2,1,91
...

Querying

"I want to see the average speeds of this road on fridays, in the past 3 months"

Select rsd.Polygon, Avg(f.Speed)
From SpeedFacts f
Inner Join DateDimension dd on f.DateId = dd.Id
Inner Join RoadSegmentDimension rsd on f.RoadSegmentId = rsd.Id
Where dd.DayOfWeekNumber = 5 and dd.Date > '3/6/2015' --you could of course get current date, subtract 3 months to make this more generic
   and rsd.Polygon.STIntersects(@boundingBox)
Group By f.RoadSegmentId, rsd.Polygon

So there are a few things to notice. How SQL server chooses to optimize this is hard to say without testing, but the structure allows for a path that seems efficient:

  1. Select from DateDimension rows greater than 3/6/2015 and DayOfWeekNumber 5. Each of these should have an index and page reads should have good utilization because the rows physical ordering will place them in the same pages. This produces 12 rows, one for each week.
  2. Now SQL Server has only 12 DateId's, and can use the index on SpeedFacts.DateID to narrow down what rows/pages from SpeedFacts to read.

The SpeedFacts.DateId index is alot smaller than if we had a SpeedFacts.Date column that was an actual date type. So the index over those millions of rows is much smaller and thus SQL Server can read through it much faster. Thus narrowing down the number of rows it is interested in the fact table.

The only thing that worries me is depending on patterns of your data, this probably doesn't eliminate the need to perform STIntersects on every single polygon in the RoadSegmentDimension. If you have sparse speed measurements, where alot of segments don't have any readings for certain timeframes, then it's possible the third step of joining to RoadSegementDimension will eliminate alot of polygons and then applying the StIntersects will only be necesary on the remaining polygons.

Regardless, the hope is it can narrow down the number of fact rows very efficiently before performing the spatial comparison.

Either way I think you will see significant performance with this kind of structure over more traditional structures, and would bet the performance gain is greater than any difference between the engines.

I don't know alot about MongoDB, but I've used other flavors of NoSQL/document/json databases, and I'm not really sure that kind of engine really lends itself well to this type of analysis.

0
chaotic3quilibrium On

If you end up wanting to explore the Geohash route more deeply, here's a more fleshed-out implementation of Geohash related functions for TSQL in which you might be interested.

As to the debate between the performance of R-Tree Indexes versus Integer-based Geohashes, I have different experiences related to big data scenarios. The trade-off is the same as in software engineering between using index Arrays, HashTables, and Trees. Each has use cases in which they are superior to the other two. The same holds for R-Tree Indexes versus Geohash clustering.