Pairwise distance calculation nested data frame

513 views Asked by At

I am looking for a way to calculate the separation distance between points in a pairwise fashion and store the results for each individual point in an accompanying nested data frame.

For example, I have this data frame (from the maps package) that contains information about us cities including their physical locations. I have discarded the rest of the information and nested the coordinates in a nested data frame. I intend to use distHaversine() from the geosphere package to calculate these distances.

library(tidyverse)

df <- maps::us.cities %>% 
  slice(1:20) %>% 
  group_by(name) %>% 
  nest(long, lat, .key = coords)

                   name            coords
                  <chr>           <list>
 1           Abilene TX <tibble [1 x 2]>
 2             Akron OH <tibble [1 x 2]>
 3           Alameda CA <tibble [1 x 2]>
 4            Albany GA <tibble [1 x 2]>
 5            Albany NY <tibble [1 x 2]>
 ...(With 15 more rows)

I have looked into using the map family of functions coupled with mutate, but I am having a difficult time. The desired results are in the form as follows:

                   name            coords            sep_dist
                  <chr>           <list>            <list>
 1           Abilene TX <tibble [1 x 2]> <tibble [19 x 2]>
 2             Akron OH <tibble [1 x 2]> <tibble [19 x 2]>
 3           Alameda CA <tibble [1 x 2]> <tibble [19 x 2]>
 4            Albany GA <tibble [1 x 2]> <tibble [19 x 2]>
 5            Albany NY <tibble [1 x 2]> <tibble [19 x 2]>
 ...(With 15 more rows)

With the sep_dist tibbles looking something like this:

               location  distance
                  <chr>     <dbl> 
 1             Akron OH      1003
 2           Alameda CA       428
 3            Albany GA      3218
 4            Albany NY      3627
 5            Albany OR        97
 ...(With 14 more rows)                       -distances completely made up

Where location is the point that is being compared to name (in this case Abilene).

2

There are 2 answers

0
www On BEST ANSWER

We can expand a "grid" with all the combination of location name and coordinates, but remove the combination with the same location name. After that, use map2_dbl to apply the distHaversine function.

library(tidyverse)
library(geosphere)

df2 <- df %>%
  # Create the grid
  mutate(name1 = name) %>%
  select(starts_with("name")) %>%
  complete(name, name1) %>%
  filter(name != name1) %>%
  left_join(df, by = "name") %>%
  left_join(df, by = c("name1" = "name")) %>%
  # Grid completed. Calcualte the distance by distHaversine
  mutate(distance = map2_dbl(coords.x, coords.y, distHaversine))

df2
# A tibble: 380 x 5
         name          name1         coords.x         coords.y  distance
        <chr>          <chr>           <list>           <list>     <dbl>
 1 Abilene TX       Akron OH <tibble [1 x 2]> <tibble [1 x 2]> 1881904.4
 2 Abilene TX     Alameda CA <tibble [1 x 2]> <tibble [1 x 2]> 2128576.9
 3 Abilene TX      Albany GA <tibble [1 x 2]> <tibble [1 x 2]> 1470577.2
 4 Abilene TX      Albany NY <tibble [1 x 2]> <tibble [1 x 2]> 2542025.1
 5 Abilene TX      Albany OR <tibble [1 x 2]> <tibble [1 x 2]> 2429367.3
 6 Abilene TX Albuquerque NM <tibble [1 x 2]> <tibble [1 x 2]>  702287.5
 7 Abilene TX  Alexandria LA <tibble [1 x 2]> <tibble [1 x 2]>  700093.2
 8 Abilene TX  Alexandria VA <tibble [1 x 2]> <tibble [1 x 2]> 2161594.6
 9 Abilene TX    Alhambra CA <tibble [1 x 2]> <tibble [1 x 2]> 1718967.5
10 Abilene TX Aliso Viejo CA <tibble [1 x 2]> <tibble [1 x 2]> 1681868.8
# ... with 370 more rows

To create the final output, we can group_by based on name and nest all the other desired columns.

df3 <- df2 %>%
  select(-starts_with("coord")) %>%
  group_by(name) %>%
  nest()

df3
# A tibble: 20 x 2
                   name              data
                  <chr>            <list>
 1           Abilene TX <tibble [19 x 2]>
 2             Akron OH <tibble [19 x 2]>
 3           Alameda CA <tibble [19 x 2]>
 4            Albany GA <tibble [19 x 2]>
 5            Albany NY <tibble [19 x 2]>
 6            Albany OR <tibble [19 x 2]>
 7       Albuquerque NM <tibble [19 x 2]>
 8        Alexandria LA <tibble [19 x 2]>
 9        Alexandria VA <tibble [19 x 2]>
10          Alhambra CA <tibble [19 x 2]>
11       Aliso Viejo CA <tibble [19 x 2]>
12             Allen TX <tibble [19 x 2]>
13         Allentown PA <tibble [19 x 2]>
14             Aloha OR <tibble [19 x 2]>
15          Altadena CA <tibble [19 x 2]>
16 Altamonte Springs FL <tibble [19 x 2]>
17           Altoona PA <tibble [19 x 2]>
18          Amarillo TX <tibble [19 x 2]>
19              Ames IA <tibble [19 x 2]>
20           Anaheim CA <tibble [19 x 2]>

And each data frame in the data now looks like this.

df3$data[[1]]
# A tibble: 19 x 2
                  name1  distance
                  <chr>     <dbl>
 1             Akron OH 1881904.4
 2           Alameda CA 2128576.9
 3            Albany GA 1470577.2
 4            Albany NY 2542025.1
 5            Albany OR 2429367.3
 6       Albuquerque NM  702287.5
 7        Alexandria LA  700093.2
 8        Alexandria VA 2161594.6
 9          Alhambra CA 1718967.5
10       Aliso Viejo CA 1681868.8
11             Allen TX  296560.4
12         Allentown PA 2342363.5
13             Aloha OR 2457938.8
14          Altadena CA 1719207.6
15 Altamonte Springs FL 1805480.9
16           Altoona PA 2102993.0
17          Amarillo TX  361520.0
18              Ames IA 1194234.7
19           Anaheim CA 1694698.9
2
CPak On

geosphere provides the ability to compare all-to-all distances with distm

Reproducible data

set.seed(1)
df <- data.frame(name=letters[1:4],
                 lon=runif(4)*10,
                 lat=runif(4)*10)

distm

library(geosphere)
ans <- as.data.frame(distm(df[,2:3], df[,2:3], fun=distHaversine))

         # a        b        c        d
# 1      0.0 784506.1 894320.6 877440.5
# 2 784506.1      0.0 226504.3 647666.7
# 3 894320.6 226504.3      0.0 486290.8
# 4 877440.5 647666.7 486290.8      0.0

Tidy into desired format

colnames(ans) <- df$name
library(dplyr)
library(tidyr)
desired <- ans %>%
             gather(pos1, distance) %>%
             mutate(pos2 = rep(df$name, nrow(df))) %>%
             filter(pos1!=pos2) %>%
             select(pos1, pos2, distance)

   # pos1 pos2 distance
# 1     a    b 784506.1
# 2     a    c 894320.6
# 3     a    d 877440.5
# 4     b    a 784506.1
# 5     b    c 226504.3
# 6     b    d 647666.7
# 7     c    a 894320.6
# 8     c    b 226504.3
# 9     c    d 486290.8
# 10    d    a 877440.5
# 11    d    b 647666.7
# 12    d    c 486290.8