Find closest line to each point on big dataset, possibly using shapely and rtree

5.8k views Asked by At

I have a simplified map of a city that has streets in it as linestrings and addresses as points. I need to find closest path from each point to any street line. I have a working script that does this, but it runs in polynomial time as it has nested for loop. For 150 000 lines (shapely LineString) and 10 000 points (shapely Point), it takes 10 hours to finish on 8 GB Ram computer.

The function looks like this (sorry for not making it entirely reproducible):

import pandas as pd
import shapely
from shapely import Point, LineString

def connect_nodes_to_closest_edges(edges_df , nodes_df,
                                   edges_geom,
                                   nodes_geom):
    """Finds closest line to points and returns 2 dataframes:
        edges_df
        nodes_df
    """
    for i in range(len(nodes_df)):
        point = nodes_df.loc[i,nodes_geom]
        shortest_distance = 100000
        for j in range(len(edges_df)):
            line = edges_df.loc[j,edges_geom]
            if line.distance(point) < shortest_distance:
                shortest_distance = line.distance(point)
                closest_street_index = j
                closest_line = line
                ...

Then I save the results in a table as a new column that adds the shortest path from point to line as a new column.

Is there a way to make it faster with some addition to the function?

If I could for example filter out lines for every point that are 50m away or so, it would help to speed every iteration?

Is there a way to make this faster using rtree package? I was able to find an answer that makes the script for finding intersection of polygons faster, but I can't seem to make it working for closest point to line.

Faster way of polygon intersection with shapely

https://pypi.python.org/pypi/Rtree/

Sorry if this was already answered, but I didn't find an answer here nor on gis.stackexchange

thank you for advice!

2

There are 2 answers

6
eguaio On BEST ANSWER

Here you have a solution using rtree library. The idea is to build boxes that contain the segments in the diagonal, and use that boxes to build the rtree. This will be the most time-expensive operation. Later, you query the rtree with a box centered in the point. You get several hits that you need to check for the minimum, but the number of hits will be (hopefuly) orders of magnitud lower than checking against all the segments.

In the solutions dict you will get, for each point, the line id, the nearest segment, the nearest point (a point of the segment), and the distance to the point.

There are some comments in the code to help you. Take into account you can serialize the rtree for later use. In fact, I would recomend to build the rtree, save it, and then use it. Because the exceptions for the adjustments of the constants MIN_SIZE and INFTY will probably raise, and you would not want to lose all the computation you did building the rtree.

A too small MIN_SIZE will mean you could have errors in the solutions because if the box around the point does not intersect a segment, it could intersect a box of a segment that is not the nearest segment (it is easy to think a case).

A too big MIN_SIZE would mean to have too much false positives, that in the extreme case would make the code to try with all the segments, and you will be in the same position as before, or worst, because you are now building an rtree you don't really use.

If the data is real data from a city, I imagine you know that any address will be intersecting a segment with a distance smaller than a few blocks. That will make the search practically logaritmic.

One more comment. I'm assuming that there are not segments that are too large. Since we are using the segments as the diagonals of the boxes in the rtree, if you have some large segments in a line, this would mean a huge box will be assigned to that segment, and all addresses boxes would intersect it. To avoid this, you can always increase artificially the resolution of the LineStrins by adding more intermediate points.

import math
from rtree import index
from shapely.geometry import Polygon, LineString

INFTY = 1000000
MIN_SIZE = .8
# MIN_SIZE should be a vaule such that if you build a box centered in each 
# point with edges of size 2*MIN_SIZE, you know a priori that at least one 
# segment is intersected with the box. Otherwise, you could get an inexact 
# solution, there is an exception checking this, though.


def distance(a, b):
    return math.sqrt( (a[0]-b[0])**2 + (a[1]-b[1])**2 ) 

def get_distance(apoint, segment):
    a = apoint
    b, c = segment
    # t = <a-b, c-b>/|c-b|**2
    # because p(a) = t*(c-b)+b is the ortogonal projection of vector a 
    # over the rectline that includes the points b and c. 
    t = (a[0]-b[0])*(c[0]-b[0]) + (a[1]-b[1])*(c[1]-b[1])
    t = t / ( (c[0]-b[0])**2 + (c[1]-b[1])**2 )
    # Only if t 0 <= t <= 1 the projection is in the interior of 
    # segment b-c, and it is the point that minimize the distance 
    # (by pitagoras theorem).
    if 0 < t < 1:
        pcoords = (t*(c[0]-b[0])+b[0], t*(c[1]-b[1])+b[1])
        dmin = distance(a, pcoords)
        return pcoords, dmin
    elif t <= 0:
        return b, distance(a, b)
    elif 1 <= t:
        return c, distance(a, c)

def get_rtree(lines):
    def generate_items():
        sindx = 0
        for lid, l in lines:
            for i in xrange(len(l)-1):
                a, b = l[i]
                c, d = l[i+1]
                segment = ((a,b), (c,d))
                box = (min(a, c), min(b,d), max(a, c), max(b,d)) 
                #box = left, bottom, right, top
                yield (sindx, box, (lid, segment))
                sindx += 1
    return index.Index(generate_items())

def get_solution(idx, points):
    result = {}
    for p in points:
        pbox = (p[0]-MIN_SIZE, p[1]-MIN_SIZE, p[0]+MIN_SIZE, p[1]+MIN_SIZE)
        hits = idx.intersection(pbox, objects='raw')    
        d = INFTY
        s = None
        for h in hits: 
            nearest_p, new_d = get_distance(p, h[1])
            if d >= new_d:
                d = new_d
                s = (h[0], h[1], nearest_p, new_d)
        result[p] = s
        print s

        #some checking you could remove after you adjust the constants
        if s == None:
            raise Exception("It seems INFTY is not big enough.")

        pboxpol = ( (pbox[0], pbox[1]), (pbox[2], pbox[1]), 
                    (pbox[2], pbox[3]), (pbox[0], pbox[3]) )
        if not Polygon(pboxpol).intersects(LineString(s[1])):  
            msg = "It seems MIN_SIZE is not big enough. "
            msg += "You could get inexact solutions if remove this exception."
            raise Exception(msg)

    return result

I tested the functions with this example.

xcoords = [i*10.0/float(1000) for i in xrange(1000)]
l1 = [(x, math.sin(x)) for x in xcoords]
l2 = [(x, math.cos(x)) for x in xcoords]
points = [(i*10.0/float(50), 0.8) for i in xrange(50)]

lines = [('l1', l1), ('l2', l2)]

idx = get_rtree(lines)

solutions = get_solution(idx, points)

And got: Result of test

0
lenhhoxung On

I've been searching for a solution, and I found this, which uses Geopandas. Basically, it is a straightforward approach, which considers the overlapping of the bounding boxes of points and lines. However, thanks to the spatial index, the computational cost is significantly reduced.