Discretization of zip codes to US regions in R

225 views Asked by At

I would like to discretize data with zip codes into regions

I have character data

sample:

zip_code
'45654'
'12321'
'99453'

etc

I have 6 categories with rules:

region 1 - NE: 01000-19999

region 2 - SE: 20000-39999

region 3 - MW: 40000-58999,60000-69999

region 4 - SW: 70000-79999,85000-88499

region 5 - MT: 59000-59999,80000-84999,88900-89999

region 6 - PC: 90000-99999

I would like my output to be factor data:

region
'MW'
'NE'
'PC'

etc

Obviously, I know many ways to discretize the data, but none are clean and elegant (like loops, ifelse, etc)

Is there an elegant way to apply a case with 6 categories to discretize this data?

3

There are 3 answers

2
sckott On BEST ANSWER

Okay, messy but this can work. I assume you'll have to use character objects since some zip codes start with 0. Obs. replace these numbers with your zip codes.

zip_code <- c('1','6','15')
regions <- list(NE = as.character(1:3), 
        SE = as.character(4:6), 
        MW = as.character(7:9), 
        SW = as.character(10:12), 
        MT = as.character(13:15), 
        PC = as.character(16:19))
sapply(zip_code, function(x) names(regions[sapply(regions, function(y) x %in% y)]))

 1    6   15 
"NE" "SE" "MT" 
0
akrun On

You could also try (Using @Scott Chamberlain's data)

  with(stack(regions), unique(ind[ave(values %in% zip_code, ind, FUN=I)]))
 #[1] NE SE MT
 #Levels: MT MW NE PC SE SW

Or

 library(dplyr)
 library(tidyr)
 unnest(regions, region) %>%
                     group_by(region) %>%
                     filter(x %in% zip_code)

 # region x
 #1   NE  1
 #2   SE  6
 #3   MT 15

Or

 r1 <- vapply(regions, function(x) any(x %in% zip_code), logical(1))
 names(r1)[r1]
 #[1] "NE" "SE" "MT"
0
jlhoward On

Here is a data.table solution using foverlaps(...) and the full US zip code database in package zipcode for the example. Note that your definitions of the ranges are deficient: for instance there are zip codes in NH that are outside the NE range, and PR is completely missing.

library(data.table)  # 1.9.4+
library(zipcode)
data(zipcode)        # database of US zip codes (a data frame)

zips    <- data.table(zip_code=zipcode$zip)
regions <- data.table(region=c("NE" , "SE", "MW", "MW", "SW", "SW", "MT", "MT", "MT", "PC"),
                      start =c(01000,20000,40000,60000,70000,85000,59000,80000,88900,90000),
                      end   =c(19999,39999,58999,69999,79999,88400,59999,84999,89999,99999))
setkey(regions,start,end)
zips[,c("start","end"):=list(as.integer(zip_code),as.integer(zip_code))]
result <- foverlaps(zips,regions)[,list(zip_code,region)]
result[sample(1:nrow(result),10)]   # random sample of the result
#    zip_code region
#  1:    27113     SE
#  2:    36101     SE
#  3:    55554     MW
#  4:    91801     PC
#  5:    20599     SE
#  6:    90250     PC
#  7:    95329     PC
#  8:    63435     MW
#  9:    60803     MW
# 10:    07040     NE

foverlaps(...) works this way: suppose a data.table x has columns a and b that represent a range (e.g., a <= b for all rows), and a data.table y has columns c and d that similarly represent a range. Then foverlaps(x,y) finds, for each row in x, all the rows in y which have overlapping ranges.

In your case we set up the y argument as the regions, where the ranges are the beginning and ending zipcodes for each (sub) region. Then we set up x as the original zip code database using the actual zip codes (converted to integer) for both the beginning and end of the range.

foverlaps(...) is extremely fast. In this case the full US zip code database (>44,000 zipcodes) was processed in about 23 milliseconds.