dplyr code "df %>% group_by(date = cut(date, breaks = "1 hour"))" no longer produces the desired result?

839 views Asked by At

I have been using the following dplyr code to generate hourly averages from 1-minute time-series data. The code has been working for months, but has recently been producing some problematic results. Has something changed with any of the following functions: group_by(), cut(), or summarise()?

df <- structure(list(date = structure(c(1505187300, 1505187360, 1505187420, 1505187480, 1505187540, 1505187600, 1505187660, 1505187720, 1505201580, 1505201640), class = c("POSIXct", "POSIXt"), tzone = "UTC"), co = c(0.149,0.149,0.149, 0.106, 0.149, 0.149, 0.192, 0.149, 0.149, 0.149), co2 = c(544L, 545L, 544L, 543L, 546L, 546L, 548L, 547L, 549L, 554L), VOC = c(22.55, 22.55, 22.8198, 23.2602, 22.9501, 23.2154, 23.4262, 23.0231, 23.0525, 22.7911), RH = c(77.02, 76.9, 77.2, 76.6, 76.99, 76.83, 77.13, 77.81, 77.48, 77.1), ugm3 = c(12.862, 13.408, 14.188, 12.342, 13.278, 12.81, 10.834, 13.018, 12.992, 12.498), temp = c(62.06, 62.02, 62.02, 61.98, 61.94, 61.9, 61.86, 61.78, 61.8, 61.8)), .Names = c("date", "co", "co2", "VOC", "RH", "ugm3", "temp"), row.names = c(NA, 10L), class = "data.frame")

new_df <- df %>% 
    group_by(date = cut(date, breaks = "1 hour")) %>%
    summarize(co = mean(co), co2 = mean(co2), VOC = mean(VOC), RH = mean(RH), ugm3 = mean(ugm3), temp = mean(temp))

new_df

Expected output:

expected_output <- structure(list(date = structure(c(1L, 5L), .Label = c("2017-09-12 03:00:00", "2017-09-12 04:00:00", "2017-09-12 05:00:00", "2017-09-12 06:00:00", "2017-09-12 07:00:00"), class = "factor"), co = c(0.149, 0.149), co2 = c(545.375, 551.5), VOC = c(22.97435, 22.9218), RH = c(77.06, 77.29), ugm3 = c(12.8425, 12.745), temp = c(61.945, 61.8)), class = c("tbl_df", "tbl", "data.frame"), .Names = c("date", "co", "co2", "VOC", "RH", "ugm3", "temp"), row.names = c(NA, -2L))

Actual output:

actual_output <- structure(list(co = 0.149, co2 = 546.6, VOC = 22.96384, RH = 77.106, ugm3 = 12.823, temp = 61.916), .Names = c("co", "co2", "VOC", "RH", "ugm3", "temp"), class = "data.frame", row.names = c(NA, -1L))

Prior to this week, this code would have generated a new df with two observations, one for the 03:00:00 hour, and one for the 07:00:00 hour. While the group_by() function appears to be assigning the new hourly timestamps correctly, the summarize() function is not behaving properly. Any insight is appreciated. Thanks!

If there are more robust alternatives to aggregating time-series data into specific intervals, I'm all ears!

1

There are 1 answers

0
moodymudskipper On BEST ANSWER

You loaded plyr after dplyr.

library(dplyr)
# ...
library(plyr)
# ------------------------------------------------------------------------------# -------------------------------------------
# 
# Attachement du package : ‘plyr’
# 
# The following objects are masked from ‘package:dplyr’:
# 
#     arrange, count, desc, failwith, id, mutate, rename, summarise, summarize

We should always read those warnings :). Now let's see what happens:

df %>% 
   group_by(date = cut(date, breaks = "1 hour")) %>%
   summarize(co = mean(co), co2 = mean(co2), VOC = mean(VOC), RH = mean(RH), ugm3 = mean(ugm3), temp = mean(temp))
#     co   co2      VOC     RH   ugm3   temp
# 1 0.149 546.6 22.96384 77.106 12.823 61.916

If you load dplyr after plyr, or use dplyr::summarize, you'll have the expected behavior.

df %>% 
   group_by(date = cut(date, breaks = "1 hour")) %>%
   dplyr::summarize(co = mean(co), co2 = mean(co2), VOC = mean(VOC), RH = mean(RH), ugm3 = mean(ugm3), temp = mean(temp))
# # A tibble: 2 x 7
#                  date    co     co2      VOC    RH    ugm3   temp
#                <fctr> <dbl>   <dbl>    <dbl> <dbl>   <dbl>  <dbl>
# 1 2017-09-12 03:00:00 0.149 545.375 22.97435 77.06 12.8425 61.945
# 2 2017-09-12 07:00:00 0.149 551.500 22.92180 77.29 12.7450 61.800