How to summarize date data by groups in R

1k views Asked by At

I would like to summarize the following sample data into a new dataframe as follows:

Population, Sample Size (N), Percent Completed (%)

Sample Size is a count of all records for each population. I can do this using the table command or tapply. Percent completed is the percentage of records with 'End Date's (all records without 'End Date' are assumed to not complete. This is where I am lost!

Sample Data

 sample <- structure(list(Population = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 
    2L, 2L, 3L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 
    1L, 2L, 2L, 3L, 3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 3L), .Label = c("Glommen", 
    "Kaseberga", "Steninge"), class = "factor"), Start_Date = structure(c(16032, 
    16032, 16032, 16032, 16032, 16036, 16036, 16036, 16037, 16038, 
    16038, 16039, 16039, 16039, 16039, 16039, 16039, 16041, 16041, 
    16041, 16041, 16041, 16041, 16044, 16044, 16045, 16045, 16045, 
    16045, 16048, 16048, 16048, 16048, 16048, 16048), class = "Date"), 
        End_Date = structure(c(NA, 16037, NA, NA, 16036, 16043, 16040, 
        16041, 16042, 16042, 16042, 16043, 16043, 16043, 16043, 16043, 
        16043, 16045, 16045, 16045, 16045, 16045, NA, 16048, 16048, 
        16049, 16049, NA, NA, 16052, 16052, 16052, 16052, 16052, 
        16052), class = "Date")), .Names = c("Population", "Start_Date", 
    "End_Date"), row.names = c(NA, 35L), class = "data.frame")
2

There are 2 answers

1
josliber On BEST ANSWER

You can do this with split/apply/combine:

spl = split(sample, sample$Population)
new.rows = lapply(spl, function(x) data.frame(Population=x$Population[1],
                                              SampleSize=nrow(x),
                                              PctComplete=sum(!is.na(x$End_Date))/nrow(x)))
combined = do.call(rbind, new.rows)
combined

#           Population SampleSize PctComplete
# Glommen      Glommen         13   0.6923077
# Kaseberga  Kaseberga          7   1.0000000
# Steninge    Steninge         15   0.8666667

One word of warning: sample is the name of a base function, so you should pick a different name for your data frame.

1
Sven Hohenstein On

It's easy with the plyr package:

library(plyr)
ddply(sample, .(Population), summarize, 
      Sample_Size = length(End_Date),
      Percent_Completed = mean(!is.na(End_Date)) * 100)

#   Population Sample_Size Percent_Completed
# 1    Glommen          13          69.23077
# 2  Kaseberga           7         100.00000
# 3   Steninge          15          86.66667