How can I optimise in R the calculation of the geographical distance between millions of pairs of centroids of polygons?
The polygons represent 111 km x 111 km grid cells covering the entire Earth.
I'm using the st_distance R function. But the high number of polygons (>11,000) suppose a computational challenge. Any suggestions on how to optimize it? In terms of accuracy, it does not need to be overly precise.
Toy code:
# Create a SpatialPolygonsDataFrame with five polygons
polygons <- st_as_sfc(list(
st_polygon(list(cbind(c(0, 0, 1, 1, 0), c(0, 1, 1, 0, 0)))),
st_polygon(list(cbind(c(1, 1, 2, 2, 1), c(0, 1, 1, 0, 0)))),
st_polygon(list(cbind(c(2, 2, 3, 3, 2), c(0, 1, 1, 0, 0)))),
st_polygon(list(cbind(c(0, 0, -1, -1, 0), c(0, -1, -1, 0, 0)))),
st_polygon(list(cbind(c(-1, -1, -2, -2, -1), c(0, -1, -1, 0, 0))))
))
st_crs(polygons)=4326
data <- data.frame(ID = 1:5, Name = c("A", "B", "C", "D", "E"))
polygons <- st_sf(polygons, data)
# Get the centroids of the polygons and calculate the distance
centroids <- st_centroid(polygons$polygons)
distance <- st_distance(centroids)
Thanks in advance
Depending on scale and required accuracy, you could
st_transformyour coordinates to an equidistant / equal-area projection.Then, round your centroid coordinates and convert to integer (this will return your coordinates in meters, for finer resolution convert to dm or similar before; the expected performance increase comes from using integers together with
dist).Finally use
distto obtain a distance matrix. Using your example datapolygons:set unique rownames to identify centroids in the distance matrix later on:
calculate distance: