Calculating mean for every second value in a dataframe

1.2k views Asked by At

I would like to aggregate each two cell values by mean and continue with the same process down the column of the dataframe. To be more precise see the following dataframe extract:

    X         Y             Z
1   FRI 200101010000    -6.72
2   FRI 200101010030    -6.30
3   FRI 200101010100    -6.26
4   FRI 200101010130    -5.82
5   FRI 200101010200    -5.64
6   FRI 200101010230    -5.29
7   FRI 200101010300    -5.82
8   FRI 200101010330    -5.83
9   FRI 200101010400    -5.83
10  FRI 200101010430    -6.04
11  FRI 200101010500    -5.80
12  FRI 200101010530    -6.09

I would like to calculate the mean of every Z by Y ending with 00 and 30, that means calculate mean of #row 1+2, #row 3+4, #row 5+6 and so on...see what I expect here:

    X         Y             Z
1   FRI 200101010100    -6.51
2   FRI 200101010200    -6.04
3   FRI 200101010300    -5.47
...

Explanation: Y is time: YYYYMMDDhhmm and I would like to average measurements of 30min to measurements of 1h

4

There are 4 answers

1
David Arenburg On BEST ANSWER

Here's a possible data.table solution

library(data.table)
setDT(df)[, .(Y = Y[1L], Z = mean(Z)), by = .(X, indx = cumsum(substr(Y, 11, 12) == '00'))]
#      X indx            Y      Z
# 1: FRI    1 200101010000 -6.510
# 2: FRI    2 200101010100 -6.040
# 3: FRI    3 200101010200 -5.465
# 4: FRI    4 200101010300 -5.825
# 5: FRI    5 200101010400 -5.935
# 6: FRI    6 200101010500 -5.945

Or per @akruns comment, using aggregate from base (though the output will need some additional tweeking probably)

aggregate(Z ~ X + indx, transform(df, indx = cumsum(substr(Y, 11, 12) == '00')), mean)
0
grrgrrbla On

a base-R solution where I first split the vector into parts and calculate the mean of each one, this of course assumes that the order you specified always holds true. Finally I combine them back to give your result:

Z <- unlist(lapply(split(df$Z, ceiling(seq_along(df$Z) / 2)), mean))
new_df <- cbind(df[seq(1,nrow(df), 2), c("X", "Y")], Z)

output:

     X            Y      Z
1  FRI 200101010000 -6.510
3  FRI 200101010100 -6.040
5  FRI 200101010200 -5.465
7  FRI 200101010300 -5.825
9  FRI 200101010400 -5.935
11 FRI 200101010500 -5.945
2
JohannesNE On

dplyr version

library(dplyr)

df$Y <- as.character(df$Y)

means <- df %>%
        group_by(hour = substr(Y, start = 1, stop=10)) %>% summarise(Z = mean(Z))

> means
Source: local data frame [6 x 2]

        hour      Z
1 2001010100 -6.510
2 2001010101 -6.040
3 2001010102 -5.465
4 2001010103 -5.825
5 2001010104 -5.935
6 2001010105 -5.945

Groups the data by the Y variable without the last two digits.

0
RubenLaguna On

Although this doesn't address the OP, in general to aggregate by second if you have POSIXct column you can:

library(lubridate)
library(tidyverse)
s <- seq(from=Sys.time(), length.out = 100, by=0.4)
df  = data.frame(time=s,v=rnorm(length(s)))
df %>% 
  group_by(time=floor_date(time, "1 second")) %>% 
  summarize(v=mean(v)) # you can put any other interval like 5 minute