I have a dataframe with tuples of latitudes and longitudes as below (sample of actual coordinates):
id latlon
67 79 (39.1791764701497, -96.5772313693982)
68 17 (39.1765194942359, -96.5677757455844)
69 76 (39.1751440428827, -96.5772939901891)
70 58 (39.175359525189, -96.5691986655256)
71 50 (39.1770962912298, -96.5668107589661)
I want to find the id
and the distance of the nearest latlon
in the same dataframe (for illustration, I'm just making up numbers below in nearest_id
and nearest_dist
columns):
id latlon nearest_id nearest_dist
67 79 (39.1791764701497, -96.5772313693982) 17 37
68 17 (39.1765194942359, -96.5677757455844) 58 150
69 76 (39.1751440428827, -96.5772939901891) 50 900
70 58 (39.175359525189, -96.5691986655256) 17 12
71 50 (39.1770962912298, -96.5668107589661) 79 4
I have a large number (45K+) of coordinates on which I want to perform this operation.
Here is my attempted solution below, using great_circle
from geopy.distances
:
def great_circle_dist(latlon1, latlon2):
"""Uses geopy to calculate distance between coordinates"""
return great_circle(latlon1, latlon2).meters
def find_nearest(x):
"""Finds nearest neighbor """
df['distances'] = df.latlon.apply(great_circle_dist, args=(x,))
df_sort = df.sort_values(by='distances')
return (df_sort.values[1][0], df_sort.values[1][2])
df['nearest'] = df['latlon'].apply(find_nearest)
df['nearest_id'] = df.nearest.apply(lambda x: x[0])
df['nearest_dist'] = df.nearest.apply(lambda x: x[1])
del df['nearest']
del df['distances']
What can be done to make this calculation efficiently?
Spatial indexing should help.
You can achieve spatial indexing using a database (e.g. Postgres with PosGIS extension), but you can also have an in-memory solution.
Have a look at the Rtree library. You will need to create an index, add all your points to the index, and then query the index using the
nearest
method.