R - ddply summarise using nlevels() does not work

644 views Asked by At

When using the plyr package to summarise my data, it seems impossible to use the nlevels() function.

The structure of my data set is as follows:

>aer <- read.xlsx("XXXX.xlsx", sheetIndex=1)
>aer$ID <- as.factor(aer$ID)
>aer$description <- as.factor(aer$description)    

  ID SOC   start        end         days  count severity relation
1  1 410   2015-04-21   2015-04-28    7     1        1        3
2  1 500   2015-01-30   2015-05-04   94     1        1        3
3  1 600   2014-11-25   2014-11-29    4     1        1        3
4  1 600   2015-01-02   2015-01-07    5     1        1        3
5  1 600   2015-01-26   2015-03-02   35     1        1        3
6  1 600   2015-04-14   2015-04-17    3     1        1        3

> dput(head(aer,4))
structure(list(ID = structure(c(1L, 1L, 1L, 1L), .Label = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "12", "13", "14", 
"15"), class = "factor"), SOC = c(410, 500, 600, 600),  
start = structure(c(16546, 16465, 16399, 16437), class = "Date"), 
end = structure(c(16553, 16559, 16403, 16442), class = "Date"), 
days = c(7, 94, 4, 5), count = c(1, 1, 1, 1), severity = c(1, 
1, 1, 1), relation = c(3, 3, 3, 3)), .Names = c("ID", "SOC", 
"description", "start", "end", "days", "count", "severity", "relation"
), row.names = c(NA, 4L), class = "data.frame")

What I would like to know is how many levels exists in the "ID" variable in data sections created, when dividing the data set using the variable "SOC". I want to summarise this information together with some other variables in a new data set. Therefore, I would like to use the plyr package like so:

summaer2 <- ddply(aer, c("SOC"), summarise,
    participants    = nlevels(ID), 
    events          = sum(count),
    min_duration    = min(days), 
    max_duration    = max(days),
    max_severity    = max(severity))

This returns the following error:

Error in Summary.factor(c(4L, 5L, 11L, 11L, 14L, 14L), na.rm = FALSE) : 
‘max’ not meaningful for factors

Could someone give me advice on how to reach my goal? Or what I'm doing wrong?

Many thanks in advance!


There are 1 answers

RmyjuloR On


Substituting nlevels(ID) with length(unique(ID)) seems to give me the desired output:

> head(summaer2)
   SOC participants events min_duration max_duration max_severity
1  100            4      7            1           62            2
2  410            9     16            1           41            2
3  431            2      2          109          132            1
4  500            5      9           23          125            2
5  600            8     19            1           35            1
6 1040            1      1           98           98            2