The Idea. I would like to build a function like:
location_affinity(user_a, user_b)
which establish a location affinity between two users. In particular, this function will return a float number between 0 (no affinity) and 1 (max affinity) indicating how much places user_a has been correspond to places user_b has been. e.g.: If user_a ALWAYS stays with user_b and follows him to every places he go, I'm expecting a "1" as result. If user_a lives far away from user_b and they never got even close to each other, I'm expecting a "0" as result.
The Data. Each user has a list of points(latitude, longitude) where he has been, and those points were already extracted from user's Facebook geotags. To visualize this: IMAGE
- Red "X"s are points(lat, lng) user_a has been.
- Green "X"s are points(lat, lng) user_b has been.
- Blue area represent the overlap.
The Question. Are there any known algorithms which, based on two users' map points list, can establish the affinity (which I gather it depends on the overlap area)? If not, which keywords should I search for?
Additional. I'm trying to build Python functions with Spark. Are there any integrations?
Thank you.
How about something like this:
First we use
scipy.spatial.distance.cdistto determine the distances between each point fromuser_ato each point fromuser_bto find the closest point for each. We then use the exponential function to exponentially suppress higher distances. The constantcdetermines how large this suppression is, smaller means large distances have a higher suppression (you will need to scale it to make sense in your actual units). Then we just look at the mean of that metric.This has the nice property that if the two sets of points are exactly equal, it returns
1.It has a small problem, though, as you can see above. This function is not symmetric. However, we can make it symmetric by considering both equally:
Of course you can use many different metrics to determine the fall-off of larger distances. Here I chose
exp(-x), but you could also use1 - tanh(x)ortanh(1/(x+epsilon))(the epsilon is needed to avoid a divison by zero in case two points are exactly identical). This results in different behaviour:Actually, you could use 1 - any function defined in this post.