Why do I get wrong dates after the second year when use aggregate?

40 views Asked by At

I'm trying to group my data according to 8-day periods, I found a code that works pretty well for the first year, but then the second one doesn't work well. I'm stuck with this problem.

df$date<-as.Date(df$date)
z <-aggregate(df$variable, 
               list(Dts = cut(as.Date(df$date, format="%d/%m/%Y"), 
                              breaks="8 day")), FUN=sum)

I have different years and my data looks like this:

structure(list(date = structure(c(5583, 5584, 5585, 5586, 5587, 
5588, 5589, 5590, 5591, 5592, 5593, 5594, 5595, 5596, 5597, 5598, 
5599, 5600, 5601, 5602, 5603, 5604, 5605, 5606, 5607, 5608, 5609, 
5610, 5611, 5612, 5613, 5614, 5615, 5616, 5617, 5618, 5619, 5620, 
5621, 5622, 5623, 5624, 5625, 5626, 5627, 5628, 5629, 5630, 5631, 
5632, 5633, 5634, 5635, 5636, 5637, 5638, 5639, 5640, 5641, 5642, 
5643, 5644, 5645, 5646, 5948, 5949, 5950, 5951, 5952, 5953, 5954, 
5955, 5956, 5957, 5958, 5959, 5960, 5961, 5962, 5963, 5964, 5965, 
5966, 5967, 5968, 5969, 5970, 5971, 5972, 5973, 5974, 5975, 5976, 
5977, 5978, 5979, 5980, 5981, 5982, 5983, 5984, 5985, 5986, 5987, 
5988, 5989, 5990, 5991, 5992, 5993, 5994, 5995, 5996, 5997, 5998, 
5999, 6000, 6001, 6002, 6003, 6004, 6005, 6006, 6007, 6008, 6009, 
6010, 6011, 6313, 6314, 6315, 6316, 6317, 6318, 6319, 6320, 6321, 
6322, 6323, 6324, 6325, 6326, 6327, 6328, 6329, 6330, 6331, 6332, 
6333, 6334, 6335, 6336, 6337, 6338, 6339, 6340, 6341, 6342, 6343, 
6344, 6345, 6346, 6347, 6348, 6349, 6350, 6351, 6352, 6353, 6354, 
6355, 6356, 6357, 6358, 6359, 6360, 6361, 6362, 6363, 6364, 6365, 
6366, 6367, 6368, 6369, 6370, 6371, 6372, 6373, 6374, 6375, 6376
), class = "Date"), variable = c(2, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0)), .Names = c("date", "variable"), row.names = c(105L, 
106L, 107L, 108L, 109L, 110L, 111L, 112L, 113L, 114L, 115L, 116L, 
117L, 118L, 119L, 120L, 121L, 122L, 123L, 124L, 125L, 126L, 127L, 
128L, 129L, 130L, 131L, 132L, 133L, 134L, 135L, 136L, 137L, 138L, 
139L, 140L, 141L, 142L, 143L, 144L, 145L, 146L, 147L, 148L, 149L, 
150L, 151L, 152L, 153L, 154L, 155L, 156L, 157L, 158L, 159L, 160L, 
161L, 162L, 163L, 164L, 165L, 166L, 167L, 168L, 470L, 471L, 472L, 
473L, 474L, 475L, 476L, 477L, 478L, 479L, 480L, 481L, 482L, 483L, 
484L, 485L, 486L, 487L, 488L, 489L, 490L, 491L, 492L, 493L, 494L, 
495L, 496L, 497L, 498L, 499L, 500L, 501L, 502L, 503L, 504L, 505L, 
506L, 507L, 508L, 509L, 510L, 511L, 512L, 513L, 514L, 515L, 516L, 
517L, 518L, 519L, 520L, 521L, 522L, 523L, 524L, 525L, 526L, 527L, 
528L, 529L, 530L, 531L, 532L, 533L, 835L, 836L, 837L, 838L, 839L, 
840L, 841L, 842L, 843L, 844L, 845L, 846L, 847L, 848L, 849L, 850L, 
851L, 852L, 853L, 854L, 855L, 856L, 857L, 858L, 859L, 860L, 861L, 
862L, 863L, 864L, 865L, 866L, 867L, 868L, 869L, 870L, 871L, 872L, 
873L, 874L, 875L, 876L, 877L, 878L, 879L, 880L, 881L, 882L, 883L, 
884L, 885L, 886L, 887L, 888L, 889L, 890L, 891L, 892L, 893L, 894L, 
895L, 896L, 897L, 898L), class = "data.frame")

This the output I get, and line 9 should start with 1986-04-15 again:

          Dts  x
1  1985-04-15  2
2  1985-04-23  0
3  1985-05-01  0
4  1985-05-09  0
5  1985-05-17  0
6  1985-05-25  0
7  1985-06-02  0
8  1985-06-10  0
9  1986-04-10  0
10 1986-04-18  0
11 1986-04-26  0
12 1986-05-04 NA
13 1986-05-12 NA
14 1986-05-20 NA
15 1986-05-28 NA
16 1986-06-05 NA
17 1986-06-13 NA
18 1987-04-13  0
19 1987-04-21  0
20 1987-04-29  0
21 1987-05-07  0
22 1987-05-15  0
23 1987-05-23  0
24 1987-05-31  0
25 1987-06-08  0
26 1987-06-16  0

The data is there but I don't know what's wrong. Thanks a lot for your help.

1

There are 1 answers

1
eipi10 On BEST ANSWER

If you go in 8-day increments starting from April 15, 1985, then April 10, 1986 will be the beginning of one of the 8-day periods. Here's a way to use the same days of the year as the breaks for other years as were used for 1985:

To set the date breaks for the cut function, get the remainder when dividing the day of the year represented by 1985-04-15 by 8. Then use this to select all dates within the range of df$date with the same remainder. This will give the same days of the year for all years (except that we need an adjustment for leap years, due to the extra day). These are the date breaks we'll use for the cut function.

Note that this approach results in the last interval of each year being five days instead of eight days. Also, in leap years, the interval around February 29th is nine days long.

library(lubridate)
library(dplyr)

index = yday(df$date[1]) %% 8 
dates = seq(min(df$date), max(df$date), "1 day")
breaks = dates[yday(dates) %% 8 == index]

# Ensure that leap years have same breaks as other years
breaks = if_else(year(breaks) %in% seq(1980,2020,4) & yday(breaks) > yday(as.Date("1980-02-28")), 
                 breaks + 1, breaks)

aggregate(df$variable, list(Dts = cut(df$date, breaks=breaks)), FUN=sum)
          Dts  x
1  1985-04-15  2
2  1985-04-23  0
3  1985-05-01  0
4  1985-05-09  0
5  1985-05-17  0
6  1985-05-25  0
7  1985-06-02  0
8  1985-06-10  0
9  1986-04-15  0
10 1986-04-23  0
11 1986-05-01  0
12 1986-05-09 NA
13 1986-05-17 NA
14 1986-05-25 NA
15 1986-06-02 NA
16 1986-06-10 NA
17 1987-04-15  0
18 1987-04-23  0
19 1987-05-01  0
20 1987-05-09  0
21 1987-05-17  0
22 1987-05-25  0
23 1987-06-02  0