I am trying to find the minimum distance between each customer to the store. Currently, there are ~1500 stores and ~670K customers in my data. I have to calculate the geo distance for 670K customers x 1500 stores and find the minimum distance for each customer.
I have created the haversine function below:
import numpy as np
def haversine_np(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
miles = 6367 * c/1.609
return miles
and my data set looks like below, 1 data frame for the customer (cst_geo
) and 1 data frame for the store (store_geo
). The numbers below are made up as I can't share the snippet of the real data:
Customer ID | Latitude | Longitude |
---|---|---|
A123 | 39.342 | -40.800 |
B456 | 38.978 | -41.759 |
C789 | 36.237 | -77.348 |
Store ID | Latitude | Longitude |
---|---|---|
S1 | 59.342 | -60.800 |
S2 | 28.978 | -71.759 |
S3 | 56.237 | -87.348 |
I wrote a for loop below to attempt this calculation but it took >8 hours to run. I have tried to use deco
but wasn't able to optimize it any further.
mindist = []
for i in cst_geo.index:
dist = []
for j in store_geo.index:
dist.append(haversine_np(cst_geo.longitude[i], cst_geo.latitude[i],
store_geo.longitude[j], store_geo.latitude[j]))
mindist.append(min(dist))
This can be done with
geopy
output
you can also have the distance in others units with
kilometers
,miles
,feet
...