Merging a Shapefile and a dataframe

4.5k views Asked by At

I am working in R with a regular dataframe (df) and a shapefile (map2), the share a common column called CD116FP. df has 103552 lines while map2 has 444 .I am loading the shapefile in the following way:

map2 <- read_sf("D:/Data/tl_2019_us_cd116.shp")

My end-goal is to use the function mapview() to view the map included in map2 with the "intensity" that is described in df under the column np_scores. I hence do not want observations of df that do not appear on map2.

Here are my thoughts and failures:

  1. If these two objects were regular dataframes, a reasonable candidate would be to use merge() to combine both objects, however if you apply that function in this case, the resulting object looses the spatial properties and mapview does not know how to read it.

  2. Another approach that I used was trying this line of code:

map2m<-data.frame(map2, df[match(map2$CD116FP, df$CD116FP),])

But the result has dimensions that are too big (much bigger that 444 lines) and hence mapview crashes when trying to plot the desired map.

  1. At last, I went full-on brute force and just constructed a loop to add the column np to map2:
map2$np=10

for (i in c(1:nrow(map2)))
{  
for (j in c(1:nrow(df)))
 {
if (identical(map2$CD116FP[i],df$CD116FP[j]))
{map2$np[i]=df$np_score[j]}
else {map2$np[i]=0}  
}
}  

However, this approach just takes way too much time given the dimensions of my dataframe.

Do you have any suggestions?

2

There are 2 answers

6
Allan Cameron On BEST ANSWER

I'm a bit puzzled by the structure of your data. Your df has over 100,000 rows, so I'm guessing that the same CD116FP occurs multiple times in df, and the npscore will presumably vary across these instances. If you want to merge these to map2 you will need to aggregate them first.

Let's try to recreate a similar setup:

library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1

map2 <- read_sf("C:/users/administrator/documents/shape/tl_2019_us_cd116.shp")

set.seed(69)

df <- data.frame(CD116FP = sprintf("%02d", sample(0:99, 103552, TRUE)),
                 npscores = runif(103552))

head(df)
#>   CD116FP  npscores
#> 1      95 0.6927742
#> 2      80 0.8543845
#> 3      90 0.5220353
#> 4      01 0.1449647
#> 5      76 0.9876543
#> 6      38 0.5629950

I have made df have the same number of rows that your data has to show this solution will scale to your problem.

Let's aggregate the npscores with dplyr:

library(dplyr)
df_sum <- df %>% 
  filter(CD116FP %in% map2$CD116FP) %>%
  group_by(CD116FP) %>%
  summarise(npscores = mean(npscores))

map2$npscores <- df_sum$npscores[match(map2$CD116FP, df_sum$CD116FP)]

Now map2 has the aggregated npscores we can plot - for example, in ggplot:

library(ggplot2)

ggplot(map2) + 
  geom_sf(aes(fill = npscores)) +
  coord_sf(xlim = c(-180, -60),
            ylim = c(15, 70)) +
  scale_fill_gradient(low = "red", high = "gold")

Or in mapview:

library(mapview)
mapView(map2, zcol = "npscores")

enter image description here Created on 2020-09-19 by the reprex package (v0.3.0)

0
Alexa On

I've had some luck using plain old merge from the base package. Here's an excerpt from my own work should this be valuable to you :-)

my_data <- read_excel("TraderDataRaw.xlsx", 
      sheet = "fsa", 
      col_types= c("text","text","text","logical","numeric","numeric")) %>% 
      mutate(resp_rate=mailed/responses)

my_map <- st_read("lfsa000b16a_e.shp", stringsAsFactors = FALSE) 

my_merged_data <- merge(my_map, my_data, 
      by.x=c("CFSAUID","PRUID","PRNAME"), 
      by.y=c("CFSAUID","PRUID","PRNAME"))