R Optimisation of the calculation of the geographical distance between a large number of polygons (>11.000)

206 views Asked by At

How can I optimise in R the calculation of the geographical distance between millions of pairs of centroids of polygons?

The polygons represent 111 km x 111 km grid cells covering the entire Earth. I'm using the st_distance R function. But the high number of polygons (>11,000) suppose a computational challenge. Any suggestions on how to optimize it? In terms of accuracy, it does not need to be overly precise.

Toy code:

# Create a SpatialPolygonsDataFrame with five polygons
polygons <- st_as_sfc(list(
  st_polygon(list(cbind(c(0, 0, 1, 1, 0), c(0, 1, 1, 0, 0)))),
  st_polygon(list(cbind(c(1, 1, 2, 2, 1), c(0, 1, 1, 0, 0)))),
  st_polygon(list(cbind(c(2, 2, 3, 3, 2), c(0, 1, 1, 0, 0)))),
  st_polygon(list(cbind(c(0, 0, -1, -1, 0), c(0, -1, -1, 0, 0)))),
  st_polygon(list(cbind(c(-1, -1, -2, -2, -1), c(0, -1, -1, 0, 0))))
))
st_crs(polygons)=4326
data <- data.frame(ID = 1:5, Name = c("A", "B", "C", "D", "E"))
polygons <- st_sf(polygons, data)

# Get the centroids of the polygons and calculate the distance
centroids <- st_centroid(polygons$polygons)
distance <- st_distance(centroids)

Thanks in advance

2

There are 2 answers

3
I_O On

Depending on scale and required accuracy, you could st_transform your coordinates to an equidistant / equal-area projection.

Then, round your centroid coordinates and convert to integer (this will return your coordinates in meters, for finer resolution convert to dm or similar before; the expected performance increase comes from using integers together with dist).

Finally use dist to obtain a distance matrix. Using your example data polygons:

df <- 
  polygons |>
  st_transform(3035) |> ## Lambert equal area, picked randomly
  rowwise() |>
  mutate(coords = polygons |> 
           st_centroid() |> 
           st_coordinates(),
         x = coords[1], y = coords[2]
  ) |>
  as.data.frame() |>
  select(Name, x, y) |>
  mutate(across(x:y, ~ round(.x, 0) |> as.integer()))

set unique rownames to identify centroids in the distance matrix later on:

rownames(df) <- df$Name
> df
  Name       x        y
A    A 3150682 -2248929
B    B 3273461 -2261293
C    C 3396377 -2272283
D    D 3022547 -2334767
E    E 2899563 -2319670

calculate distance:

df |> select(x:y) |> dist()
         A        B        C        D
B 123400.0                           
C 246802.4 123406.3                  
D 154229.5 261450.3 379016.0         
E 260892.8 378427.8 499068.8 123907.2
0
nukubiho On

There are several packages available in R that allow you to compute a distance matrix using various distance functions (e.g. Haversine, Vincenty, geodesic). Here is a comparison of 4 packages and {geodist} seems to be the fastest. Note that the distance results are different.

library("sf")
library("terra")
library("geodist")
library("geosphere")

n = 4000
df = data.frame(x = runif(n, -180, 180), y = runif(n, -90, 90))
pts_sf = st_as_sf(df, coords = c("x", "y"), crs = "epsg:4326")
pts_terra = vect(df, geom = c("x", "y"), crs = "epsg:4326")

t = bench::mark(
  iterations = 5, check = FALSE,
  sf = st_distance(pts_sf),
  terra = as.matrix(terra::distance(pts_terra)),
  geodist = geodist(df, measure = "haversine"),
  geosphere = distm(df, fun = distHaversine)
)
t[, 1:5]
#>   expression      min   median `itr/sec` mem_alloc
#> 1 sf           20.57s   20.76s    0.0480  125.77MB
#> 2 terra        13.02s   13.11s    0.0764  579.85MB
#> 3 geodist    791.53ms 806.79ms    1.24    244.34MB
#> 4 geosphere     2.68s    2.83s    0.351     2.81GB