R dplyr, using mutate with na.omit causes error incompatible size (%d)

3.8k views Asked by At

I'm doing data cleaning. I use mutate in Dplyr a lot since it generates new columns step by step and I can easily see how it goes.

Here are two examples where I have this error

Error: incompatible size (%d), expecting %d (the group size) or 1

Example 1: Get town name from zipcode. Data is simply like this:

    Zip
1 02345
2 02201

And I notice when the data has NA in it, it doesn't work.

Without NA it works:

library(dplyr)
library(zipcode)
data(zipcode)

test = data.frame(Zip=c('02345','02201'),stringsAsFactors=FALSE)

test %>%
  rowwise() %>%
  mutate( Town1 = zipcode[zipcode$zip==na.omit(Zip),'city'] )

resulting in

Source: local data frame [2 x 2]
Groups: <by row>

    Zip   Town1
1 02345 Manomet
2 02201  Boston

With NA it doesn't work:

library(dplyr)
library(zipcode)
data(zipcode)

test = data.frame(Zip=c('02345','02201',NA),stringsAsFactors=FALSE)

test %>%
  rowwise() %>%
  mutate( Town1 = zipcode[zipcode$zip==na.omit(Zip),'city'] )

resulting in

Error: incompatible size (%d), expecting %d (the group size) or 1

Example2. I wanna get rid of the redundant state name that occurs in the Town column in the following data.

         Town State
1   BOSTON MA    MA
2 NORTH AMAMS    MA
3  CHICAGO IL    IL

This is how I do it: (1) split the string in Town into words, e.g. 'BOSTON' and 'MA' for row 1. (2) see if any of these words match the State of that line (3) delete the matched words

library(dplyr)
test = data.frame(Town=c('BOSTON MA','NORTH AMAMS','CHICAGO IL'), State=c('MA','MA','IL'), stringsAsFactors=FALSE)

test %>%
  mutate(Town.word = strsplit(Town, split=' ')) %>%
  rowwise() %>% # rowwise ensures every calculation only consider currect row
  mutate(is.state = match(State,Town.word ) ) %>%
  mutate(Town1 = Town.word[-is.state])

This results in:

         Town State Town.word is.state   Town1
1   BOSTON MA    MA  <chr[2]>        2  BOSTON
2 NORTH AMAMS    MA  <chr[2]>       NA      NA
3  CHICAGO IL    IL  <chr[2]>        2 CHICAGO

Meaning: E.g., row 1 shows is.state==2, meaning the 2nd word in Town is the state name. After getting rid of that work, Town1 is the correct town name.

Now I wanna fix the NA in row 2, but add na.omit would cause error:

test %>%
  mutate(Town.word = strsplit(Town, split=' ')) %>%
  rowwise() %>% # rowwise ensures every calculation only consider currect row
  mutate(is.state = match(State,Town.word ) ) %>%
  mutate(Town1 = Town.word[-na.omit(is.state)]) 

results in:

Error: incompatible size (%d), expecting %d (the group size) or 1

I checked the data type and size:

test %>%
  mutate(Town.word = strsplit(Town, split=' ')) %>%
  rowwise() %>% # rowwise ensures every calculation only consider currect row
  mutate(is.state = match(State,Town.word ) ) %>%
  mutate(length(is.state) ) %>%       
  mutate(class(na.omit(is.state)))

results in:

         Town State Town.word is.state length(is.state) class(na.omit(is.state))
1   BOSTON MA    MA  <chr[2]>        2                1                  integer
2 NORTH AMAMS    MA  <chr[2]>       NA                1                  integer
3  CHICAGO IL    IL  <chr[2]>        2                1                  integer

So it is %d of length==1. Can somebody where's wrong? Thanks

1

There are 1 answers

11
r2evans On

Can you just sub it out?

test %>%
    rowwise() %>%
    mutate(Town=sub(sprintf('[, ]*%s$', State), '', Town))
## Source: local data frame [3 x 2]
## Groups: <by row>
##
##          Town State
## 1      BOSTON    MA
## 2 NORTH AMAMS    MA
## 3     CHICAGO    IL

(This way also catches commas after the town, if that happens.)

NB: if you use ungroup() here with a rowwise_df (as this is), it will wipe the tbl_df class as well and output a straight data.frame, which is fine for your data but will clobber your screen if you aren't careful and are looking at large amounts of data (as I've done countless times). (Github references #936 and #553.)