I have two datasets with spatial data.
Dataset 1 has approximately 15,000,000 records. Dataset 2 has approximately 16,000,000 records.
Both are using the data type geography (GPS coordinates) and all records are points.
Both tables have spatial indexes with cells_per_object = 1 and the levels are (HIGH, HIGH, HIGH, HIGH)
All points are located in a, globally speaking, small area (1 U.S. state). The points are spread out enough to warrant using geography rather than a projection to geometry.
DECLARE @g GEOGRAPHY
SET @g = (SELECT TOP 1 GPSPoint FROM Dataset1)
EXEC sp_help_spatial_geography_index 'Dataset1', 'Dataset1_SpatialIndex', 0, @g
Shows
propvalue-propname
1-Total_Number_Of_ObjectCells_In_Level0_For_QuerySample
28178-Total_Number_Of_ObjectCells_In_Level1_In_Index
1-Total_Number_Of_ObjectCells_In_Level4_For_QuerySample
14923330-Total_Number_Of_ObjectCells_In_Level4_In_Index
1-Total_Number_Of_Intersecting_ObjectCells_In_Level1_In_Index
1-Total_Number_Of_Intersecting_ObjectCells_In_Level4_For_QuerySample
14923330-Total_Number_Of_Intersecting_ObjectCells_In_Level4_In_Index
1-Total_Number_Of_Border_ObjectCells_In_Level0_For_QuerySample
28177-Total_Number_Of_Border_ObjectCells_In_Level1_In_Index
740-Number_Of_Rows_Selected_By_Primary_Filter
0-Number_Of_Rows_Selected_By_Internal_Filter
740-Number_Of_Times_Secondary_Filter_Is_Called
1-Number_Of_Rows_Output
99.99504-Percentage_Of_Rows_NotSelected_By_Primary_Filter
0-Percentage_Of_Primary_Filter_Rows_Selected_By_Internal_Filter
0-Internal_Filter_Efficiency
0.135135-Primary_Filter_Efficiency
Which means that the query
DECLARE @g GEOGRPAHY
SET @g = (SELECT TOP 1 GPSPoint FROM Dataset1)
SELECT TOP 1
*
FROM
Dataset2 D
WHERE
@g.Filter(D.GPSPoint.STBuffer(1)) = 1
Takes almost an hour to complete.
I have also tried doing
WITH TABLE1 AS (
SELECT
A.RecordID,
B.RecordID,
RANK() OVER (PARTITION BY A.RecordID ORDER BY A.GPSPoint.STDistance(B.GPSPoint) ASC) AS 'Ranking'
FROM
Dataset1 A
INNER JOIN
Dataset2 B
ON
B.GPSPoint.Filter(A.GPSPoint.STBuffer(1)) = 1
AND A.GPSPoint.STDistance(B.GPSPoint) <= 50
)
SELECT
*
FROM
TABLE1
WHERE
Ranking = 1
Which ends up being about 1,000 times faster, but at that rate what I am trying to do will take a query running for six months to complete. I honestly do no know what to do at this point. The end goal is to do a nearest neighbor search for every record in dataset1 to find the closest point in dataset2, but like this it seems impossible.
Does anyone have any ideas where I could improve the efficiency of this process?
Try this: It is based on recommendations on MSDN.
Note I have removed this, try the query above first, then add these predicates and see how it effects the indexing.
The following requirements must be met for a Nearest Neighbor query to use a spatial index: