Find maximum point with in each polygon for a set of polygons R

609 views Asked by At

I'm sure this question has been answered elsewhere, but I have not been able to come up with it by searching.

I have points representing cities within a country along with population for each city. I also have a polygon file of counties. I want to find the location of the largest city within each county.

How can this be done?

Here is some data

structure(list(Country = c("us", "us", "us", "us", "us", "us", "us", "us", "us", "us", "us",
"us", "us", "us", "us", "us", "us", "us", "us", "us", "us", "us", "us", "us", "us"), City = c("cabarrus", "cox store", "cal-vel", "briarwood townhouses", "barker heights", "davie
crossroads", "crab point village", "azalea", "chesterfield", "charlesmont", "connor", "clover garden", "corriher heights", "callisons", "crestview acres", "clegg", "canaan park", "chantilly", "belgrade", "brices crossroads", "bluff", "butner", "bottom", "bandy", "bostian heights"), AccentCity = c("Cabarrus", "Cox Store", "Cal-Vel", "Briarwood Townhouses", "Barker Heights", "Davie Crossroads", "Crab Point Village", "Azalea", "Chesterfield", "Charlesmont", "Connor", "Clover Garden", "Corriher Heights", "Callisons", "Crestview Acres", "Clegg", "Canaan Park", "Chantilly", "Belgrade", "Brices Crossroads", "Bluff", "Butner", "Bottom", "Bandy", "Bostian Heights"), Region = c("NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC", "NC"), Population = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, A_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_), Latitude = (35.2369444, 35.275, 36.4291667, 35.295, 35.3111111, 35.8319444, 34.7602778, 35.58, 35.81, 5.9341667, 35.7419444, 36.1883333, 35.5605556, 35.0841667, 35.0213889, 35.8655556, 36.2761111, 36.3016667, 34.88, 34.8186111, 35.8377778, 36.1319444, 36.4747222, 35.6419444, 35.7544444), Longitude = c(-80.5419444, -82.0352778, -78.9694444, -81.5238889, -82.4441667, -80.535, -76.7305556, -82.4713889, -81.6611111, -81.5127778, -78.1486111, -79.4630556, -80.635, -76.7255556, -80.5427778, -78.8497222, -79.7852778, -76.1711111, -77.2352778, -78.1016667, -82.8580556, -78.7569444, -80.7741667, -81.09, -80.9294444)), .Names = c("Country", "City", "AccentCity", "Region", "Population", "Latitude", "Longitude"), row.names = c(544L, 889L, 551L, 434L, 190L, 975L, 894L, 147L, 717L, 700L, 831L, 773L, 862L, 559L, 915L, 753L, 584L, 695L, 262L, 437L, 372L, 537L, 406L, 178L, 02L), class = "data.frame")

and some code to read in north carolina

xx <- readShapePoly(system.file("shapes/sids.shp", package="maptools")[1],
                IDvar="FIPSNO", proj4string=CRS("+proj=longlat +ellps=clrk66"))

plot(xx)

I want to find the city with the maximum population within each county. i'm sorry I don't have a reproducible example. If I did, I would have the answer!

1

There are 1 answers

4
jlhoward On

The short answer is that you should use gContains(...) in package rgeos.

Here is the long answer.

In the code below, we grab a high resolution shapefile of North Carolina counties from the GADM database, and a geocoded dataset of North Carolina cities from from the US Geological Survey database. The latter already has county information but we ignore that. Then we map cities to their appropriate county using gContains(...), add that information to the cities data frame, and identify the largest city in each county using the data.table package. Most of the work is in 4 lines of code near the end.

library(raster)   # for getData(...);   you may not need this
library(foreign)  # for read.dbf(...);  you may not need this
library(rgeos)    # for gContains(...); loads package sp as well

setwd("< directory for downloaded data >")
# get North Carolina Counties shapefile from GADM database
USA         <- getData("GADM",country="USA",level=2)   # level 2 is counties
NC.counties <- USA[USA$NAME_1=="North Carolina",]      # North Carolina Counties
# get North Carolina Cities data from USGS database
url <- "http://dds.cr.usgs.gov/pub/data/nationalatlas/citiesx010g_shp_nt00962.tar.gz"
download.file(url,"cities.tar.gz")
untar("cities.tar.gz")
data      <- read.dbf("citiesx010g.dbf",as.is=TRUE)
NC.data   <- data[data$STATE=="NC",c("NAME","COUNTY","LATITUDE","LONGITUDE","POP_2010")]
## --- evverything up to here is just to set up the example

# convert cities data.frame to SpatialPointsDataFrame
NC.cities <- SpatialPointsDataFrame(NC.data[,c("LONGITUDE","LATITUDE")],
                                    data=NC.data,
                                    proj4string=CRS(proj4string(NC.counties)))
# map cities to counties
city.cnty   <- gContains(NC.counties,NC.cities,byid=TRUE)
# add county information to cities data
NC.data$county <- apply(city.cnty,1,function(cnty)ifelse(any(cnty),NC.counties@data[cnty,]$NAME_2,NA))
# identify largest city in each county
library(data.table)
result <- setDT(NC.data)[,.SD[which.max(POP_2010)],by="county"]
head(result)
#      county             NAME   COUNTY LATITUDE LONGITUDE POP_2010
# 1:  Jackson        Cullowhee  Jackson 35.31371 -83.17653     6228
# 2:   Graham     Robbinsville   Graham 35.32287 -83.80740      620
# 3:   Wilkes North Wilkesboro   Wilkes 36.15847 -81.14758     4245
# 4:    Rowan        Salisbury    Rowan 35.67097 -80.47423    33662
# 5: Buncombe        Asheville Buncombe 35.60095 -82.55402    83393
# 6:    Wayne        Goldsboro    Wayne 35.38488 -77.99277    36437

The workhorse here is the line:

city.cnty   <- gContains(NC.counties,NC.cities,byid=TRUE)

This compares every point in the SpatialPointsDataFrame NC.Cities to every Polygon in the SpatialPolygonsDataFrame NC.counties and returns a logical matrix where tthe rows represent cities and the columns represent counties, and the [i,j] element is TRUE if city i is in county j, FALSE otherwise. We process the matrix row-wise in the next statement:

NC.data$county <- apply(city.cnty,1,function(cnty)ifelse(any(cnty),NC.counties@data[cnty,]$NAME_2,NA))

using each row in succession to index the attributes table for NC.counties to extract the county name.

The data you provided in your question has some problems which are nevertheless instructive. First, the NC shapefile in the maptools package is relatively low resolution. In particular this means that some of the coastal islands are completely missing, so any city on one of those islands will not map to a county. You might have the same problem with your real data so watch out for it.

Second, comparing the COUNTY column in the original USGS dataset with the county column which we added, there are 3 (out of 865) counties that do not agree. It turns out that, in those cases, the USGS database was wrong (or out of date). You might have the same problem so watch out for that too.

Third, an additional three cities did not map to any county. These were all coastal cities and probably reflect small inaccuracies in the North Carolina shapefile. You night have this problem as well.