Find Distance to Nearest GPS Coordinates (Nearest Neighbors Search)

1.9k views Asked by At

I have a dataframe with tuples of latitudes and longitudes as below (sample of actual coordinates):

    id    latlon             
67  79    (39.1791764701497, -96.5772313693982)
68  17    (39.1765194942359, -96.5677757455844)
69  76    (39.1751440428827, -96.5772939901891)
70  58    (39.175359525189, -96.5691986655256)
71  50    (39.1770962912298, -96.5668107589661)

I want to find the id and the distance of the nearest latlon in the same dataframe (for illustration, I'm just making up numbers below in nearest_id and nearest_dist columns):

    id    latlon                                  nearest_id  nearest_dist
67  79    (39.1791764701497, -96.5772313693982)   17          37          
68  17    (39.1765194942359, -96.5677757455844)   58          150           
69  76    (39.1751440428827, -96.5772939901891)   50          900          
70  58    (39.175359525189, -96.5691986655256)    17          12          
71  50    (39.1770962912298, -96.5668107589661)   79          4      

I have a large number (45K+) of coordinates on which I want to perform this operation.

Here is my attempted solution below, using great_circle from geopy.distances:

def great_circle_dist(latlon1, latlon2):
    """Uses geopy to calculate distance between coordinates"""
    return great_circle(latlon1, latlon2).meters

def find_nearest(x):
        """Finds nearest neighbor """
        df['distances'] = df.latlon.apply(great_circle_dist, args=(x,))
        df_sort = df.sort_values(by='distances')
        return (df_sort.values[1][0], df_sort.values[1][2])

df['nearest'] = df['latlon'].apply(find_nearest)
df['nearest_id'] = df.nearest.apply(lambda x: x[0])
df['nearest_dist'] = df.nearest.apply(lambda x: x[1])
del df['nearest']
del df['distances']

What can be done to make this calculation efficiently?

3

There are 3 answers

0
daphshez On BEST ANSWER

Spatial indexing should help.

You can achieve spatial indexing using a database (e.g. Postgres with PosGIS extension), but you can also have an in-memory solution.

Have a look at the Rtree library. You will need to create an index, add all your points to the index, and then query the index using the nearest method.

0
Bootstrap On

You can do this with PostGIS/PostgreSQL efficiently but then you would have to get your data into an sql table which might be difficult. You can issue postgresql commands from python but you still have to set up the backend. Hopefully someone will be able to give you tips on how to use this just using python.

4
Alz On

'scipy.spatial' has many useful (and extremely fast) algorithm for spatial search. The one that seems to be the right tool for your problem is 'cKDTree'.

tree = cKDTree(data)

Data should be a numpy array of shape n*2 (it can calculate distance in n dimentional space, but in this case we have two dimensions)

Then you can query the tree for k nearest neighbors:

dist, idx = tree.query(x, k=1)

Using the index, it should be trivial to get the id. I answered a similar question here. Also check out comments for information about projection.