Using Table to Group: invalid 'type' (character) of argument

4.6k views Asked by At

I recently asked a question about how to take the contents of a column and use them as column headers in a new data-frame with a Boolean expression of 1 or 0. if it contained the value in R

An example would be

Id.   Event
A.    Wc
B.    Df
C.    Df
A.    Df

Needs to be converted to

     Wc df
A   1.    1
B   0.     1
C.  0.    1

I have since being playing around with it and it seems to work fine however recently i have been getting the following error

Error in FUN(X[[1L]], ...) : invalid 'type' (character) of argument

# get the totals by counting factors for SMS Type and number of replies 
cols <- c("SMS.Type", "Replied")
setDT(train)[, paste0(cols, ".count") := 
       lapply(.SD, function(x) length(unique(na.omit(x)))), 
     .SDcols = cols, 
     by = awb_no]


# Summerize a column and convert it to boolean column header
lst <- train$SMS.Type
lvl <- unique(unlist(lst))      
train.agg.chkpt <- data.frame(ID_no=train$ID_no,
          do.call(rbind,lapply(lst, function(x) table(factor(x,levels=lvl)))), 
          stringsAsFactors=FALSE)

train.agg.chkpt <- aggregate (train.agg.chkpt,by=list(ID_no=train.agg.chkpt$ID_no), FUN = "sum")
train.agg.chkpt <- train.agg.chkpt[c(-1)]

The column ID_no is just an ID number and this is the ID around which the booleans are grouped. Its a character type number (I assume this is what the error message is referencing)

Each ID should be unique. Below is the structure of the dataset

str(train.agg.chkpt)
'data.frame':   823462 obs. of  12 variables:
  $ ID_no  : chr  "AAAAAAA75465" "BBBBB175465" "CCCCCC75476" "DDDDD75476" ...
 $ WC      : int  1 0 0 1 0 0 0 1 0 1 ...
 $ DF1     : int  0 1 1 0 0 0 0 0 0 0 ...
 $ DF2     : int  0 0 0 0 1 1 1 0 1 0 ...
 $ WCB14   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ WCA13   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ HN      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ WCB13   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ WCA12   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ WCA14   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ WCB12   : int  0 0 0 0 0 0 0 0 0 0 ...

Below is the traceback()

lapply(X = split(e, grp), FUN = FUN, ...)
4: FUN(X[[1L]], ...)
3: lapply(x, function(e) {
   ans <- lapply(X = split(e, grp), FUN = FUN, ...)
   if (simplify && length(len <- unique(sapply(ans, length))) == 
       1L) {
       if (len == 1L) {
           cl <- lapply(ans, oldClass)
           cl1 <- cl[[1L]]
           ans <- unlist(ans, recursive = FALSE)
           if (!is.null(cl1) && all(sapply(cl, function(x) identical(x, 
               cl1)))) 
               class(ans) <- cl1
       }
       else if (len > 1L) 
           ans <- matrix(unlist(ans, recursive = FALSE), nrow = nry, 
               ncol = len, byrow = TRUE, dimnames = {
                 if (!is.null(nms <- names(ans[[1L]]))) 
                   list(NULL, nms)
                 else NULL
               })
   }
   ans
   })
2: aggregate.data.frame(train.agg.chkpt, by = list(ID_no = train.agg.chkpt$ID_no), 
   FUN = "sum")
1: aggregate(train.agg.chkpt, by = list(ID_no = train.agg.chkpt$ID_no), 
   FUN = "sum")

Can anyone help me understand the error message?

Thank you for your time

1

There are 1 answers

1
David Arenburg On BEST ANSWER

Your desired output could be easily reached with a simple table implementation per each Id. Here's a possible data.table (which you already using) implementation

library(data.table)
setDT(df)[, as.list(table(Event)), by = Id]
#    Id Df Wc
# 1:  A  1  1
# 2:  B  1  0
# 3:  C  1  0

Or alternatively, (as suggested) you could use a simple dcast

dcast(setDT(df), Id ~ Event, fun = length, value.var = "Event")
#    Id Df Wc
# 1:  A  1  1
# 2:  B  1  0
# 3:  C  1  0

Or similarly

library(reshape2)
dcast(df, Id ~ Event, fun = length, value.var = "Event")

Or using tidyr (See Note below)

library(tidyr)
df$indx <- 1
spread(df, Event, indx, fill = 0) 
#   Id Df Wc
# 1  A  1  1
# 2  B  1  0
# 3  C  1  0

Or using reshape from base R (See Note below)

reshape(df, idvar = "Id", timevar = "Event", direction = "wide", v.names = "indx")
#   Id indx.Wc indx.Df
# 1  A       1       1
# 2  B      NA       1
# 3  C      NA       1

  • Note spread and reshape won't work here in case same Id has same Event more than once because they don't have the fun.aggregate argument, so they won't know how to handle it.

Benchmarks

library(microbenchmark)
set.seed(123)
n <- 1e7
df <- data.frame(Id = sample(LETTERS, n, replace  = TRUE),
                 Event = sample(outer(LETTERS, letters, paste0), n, replace = TRUE))
dt <- copy(df)

DT1 <- function(x) setDT(x)[, as.list(table(Event)), by = Id]
DT2 <- function(x) dcast.data.table(setDT(x), Id ~ Event, fun = length, value.var = "Event")
RESHAPE2 <- function(x) dcast(x, Id ~ Event, fun = length, value.var = "Event")

microbenchmark(DT1(dt), DT2(dt), RESHAPE2(df))
# Unit: milliseconds
#         expr       min        lq      mean    median        uq       max neval
#      DT1(dt)  965.5181  987.8140 1017.8237 1007.1197 1030.7272 1285.9206   100
#      DT2(dt)  406.7124  420.6203  446.8026  434.2489  455.4364  592.4333   100
# RESHAPE2(df) 2969.0057 3035.5817 3190.6514 3099.3221 3240.4642 4384.6316   100

enter image description here