Query Optimization Problems (spatial)

185 views Asked by At

I have two datasets with spatial data.

Dataset 1 has approximately 15,000,000 records. Dataset 2 has approximately 16,000,000 records.

Both are using the data type geography (GPS coordinates) and all records are points.

Both tables have spatial indexes with cells_per_object = 1 and the levels are (HIGH, HIGH, HIGH, HIGH)

All points are located in a, globally speaking, small area (1 U.S. state). The points are spread out enough to warrant using geography rather than a projection to geometry.

DECLARE @g GEOGRAPHY
SET @g = (SELECT TOP 1 GPSPoint FROM Dataset1)

EXEC sp_help_spatial_geography_index 'Dataset1', 'Dataset1_SpatialIndex', 0, @g

Shows

propvalue-propname
       1-Total_Number_Of_ObjectCells_In_Level0_For_QuerySample
   28178-Total_Number_Of_ObjectCells_In_Level1_In_Index
       1-Total_Number_Of_ObjectCells_In_Level4_For_QuerySample
14923330-Total_Number_Of_ObjectCells_In_Level4_In_Index
       1-Total_Number_Of_Intersecting_ObjectCells_In_Level1_In_Index
       1-Total_Number_Of_Intersecting_ObjectCells_In_Level4_For_QuerySample
14923330-Total_Number_Of_Intersecting_ObjectCells_In_Level4_In_Index
       1-Total_Number_Of_Border_ObjectCells_In_Level0_For_QuerySample
   28177-Total_Number_Of_Border_ObjectCells_In_Level1_In_Index
     740-Number_Of_Rows_Selected_By_Primary_Filter
       0-Number_Of_Rows_Selected_By_Internal_Filter
     740-Number_Of_Times_Secondary_Filter_Is_Called
       1-Number_Of_Rows_Output
99.99504-Percentage_Of_Rows_NotSelected_By_Primary_Filter
       0-Percentage_Of_Primary_Filter_Rows_Selected_By_Internal_Filter
       0-Internal_Filter_Efficiency
0.135135-Primary_Filter_Efficiency  

Which means that the query

DECLARE @g GEOGRPAHY
SET @g = (SELECT TOP 1 GPSPoint FROM Dataset1)

SELECT TOP 1
  *
FROM
  Dataset2 D
WHERE
  @g.Filter(D.GPSPoint.STBuffer(1)) = 1

Takes almost an hour to complete.

I have also tried doing

WITH TABLE1 AS (
  SELECT
    A.RecordID,
    B.RecordID,
    RANK() OVER (PARTITION BY A.RecordID ORDER BY A.GPSPoint.STDistance(B.GPSPoint) ASC) AS 'Ranking'
  FROM
    Dataset1 A
  INNER JOIN
    Dataset2 B
  ON
    B.GPSPoint.Filter(A.GPSPoint.STBuffer(1)) = 1
    AND A.GPSPoint.STDistance(B.GPSPoint) <= 50
)

SELECT 
  *
FROM
  TABLE1
WHERE
  Ranking = 1

Which ends up being about 1,000 times faster, but at that rate what I am trying to do will take a query running for six months to complete. I honestly do no know what to do at this point. The end goal is to do a nearest neighbor search for every record in dataset1 to find the closest point in dataset2, but like this it seems impossible.

Does anyone have any ideas where I could improve the efficiency of this process?

1

There are 1 answers

0
g2server On BEST ANSWER

Try this: It is based on recommendations on MSDN.

  SELECT TOP(1)
    A.RecordID,
    B.RecordID,
    A.GPSPoint.STDistance(B.GPSPoint) AS Distance
  FROM
    Dataset1 A
  INNER JOIN
    Dataset2 B
  ON
    A.GPSPoint.STDistance(B.GPSPoint) <= 50
    AND B.GPSPoint IS NOT NULL
  ORDER BY BY A.GPSPoint.STDistance(B.GPSPoint) ASC

Note I have removed this, try the query above first, then add these predicates and see how it effects the indexing.

B.GPSPoint.Filter(A.GPSPoint.STBuffer(1)) = 1 
    AND 
//or try B.GPSPoint.STIntersects(A.GPSPoint.STBuffer(1)) = 1

The following requirements must be met for a Nearest Neighbor query to use a spatial index:

  • A spatial index must be present on one of the spatial columns and the STDistance() method must use that column in the WHERE and ORDER BY clauses.
  • The TOP clause cannot contain a PERCENT statement.
  • The WHERE clause must contain a STDistance() method
  • If there are multiple predicates in the WHERE clause then the predicate containing STDistance() method must be connected by an AND conjunction to the other predicates. The STDistance() method cannot be in an optional part of the WHERE clause.
  • The first expression in the ORDER BY clause must use the STDistance() method.
  • Sort order for the first STDistance() expression in the ORDER BY clause must be ASC.
  • All the rows for which STDistance returns NULL must be filtered out.