I've got a tsibble where timestamped observational data has been aggregated to 30-minute intervals. The data is in several groups, and I'd like to make sure that each 30-minute group appears in the tsibble, even when there were no observations in that time period.
Let's return to the birdwatching example from my previous question about tsibbles. Suppose I'm watching duck and geese at a certain location from 8:00 to 18:00 each day and recording, for each observation, a) the time, b) the type of bird observed, and c) the number of birds in the flock observed.
library(tidyverse) # includes lubridate
library(tsibble)
N <- 10
set.seed(42)
# suppose we're observing ducks and geese between 8:00 and 18:00.
d <- as_datetime("2023-03-08 08:00:00")
times <- d + seconds(unique(round(sort(runif(N, min = 0, max = 36e3)))))
nObs <- 1 + rpois(length(times), lambda = 1)
birdIdx <- 1 + round(runif(length(times)))
birds <- c("Duck", "Goose")[birdIdx]
# Tibble of observations
waterfowl <- tibble(Timestamp = times, Count = nObs, Bird = as_factor(birds))
# Convert to tsibble (time series tibble) and aggregate on a 30-minute basis
waterfowl |>
as_tsibble(index = Timestamp) |>
group_by(Bird) |>
index_by(Interval = floor_date(Timestamp, "30 minute")) |>
summarize(`Total birds` = sum(Count)) -> waterfowlSumm
waterfowlSumm |> print(n = Inf)
This gives
# A tsibble: 10 x 3 [30m] <UTC>
# Key: Bird [2]
Bird Interval `Total birds`
<fct> <dttm> <dbl>
1 Goose 2023-03-08 09:00:00 2
2 Goose 2023-03-08 13:00:00 4
3 Goose 2023-03-08 14:00:00 1
4 Goose 2023-03-08 15:00:00 4
5 Goose 2023-03-08 16:00:00 1
6 Goose 2023-03-08 17:00:00 2
7 Duck 2023-03-08 10:30:00 2
8 Duck 2023-03-08 14:30:00 2
9 Duck 2023-03-08 15:00:00 4
10 Duck 2023-03-08 17:00:00 2
What I'd like to do is fill missing intervals. I can use fill_gaps for this:
> waterfowlSumm |> fill_gaps(`Total birds` = 0) |> print(n = Inf)
# A tsibble: 31 x 3 [30m] <UTC>
# Key: Bird [2]
Bird Interval `Total birds`
<fct> <dttm> <dbl>
1 Goose 2023-03-08 09:00:00 2
2 Goose 2023-03-08 09:30:00 0
3 Goose 2023-03-08 10:00:00 0
...
15 Goose 2023-03-08 16:00:00 1
16 Goose 2023-03-08 16:30:00 0
17 Goose 2023-03-08 17:00:00 2
18 Duck 2023-03-08 10:30:00 2
19 Duck 2023-03-08 11:00:00 0
20 Duck 2023-03-08 11:30:00 0
...
29 Duck 2023-03-08 16:00:00 0
30 Duck 2023-03-08 16:30:00 0
31 Duck 2023-03-08 17:00:00 2
However, since I start watching birds at 8:00 and stop at 18:00, I'd like to fill in missing intervals beyond the times where I actually observed birds. So I might do
> waterfowlSumm |> fill_gaps(`Total birds` = 0, .start = d, .end = d + hours(9) + minutes(30)) |> print(n = Inf)
# A tsibble: 40 x 3 [30m] <UTC>
# Key: Bird [2]
Bird Interval `Total birds`
<fct> <dttm> <dbl>
1 Goose 2023-03-08 08:00:00 0
2 Goose 2023-03-08 08:30:00 0
3 Goose 2023-03-08 09:00:00 2
...
18 Goose 2023-03-08 16:30:00 0
19 Goose 2023-03-08 17:00:00 2
20 Goose 2023-03-08 17:30:00 0
21 Duck 2023-03-08 08:00:00 0
22 Duck 2023-03-08 08:30:00 0
23 Duck 2023-03-08 09:00:00 0
...
38 Duck 2023-03-08 16:30:00 0
39 Duck 2023-03-08 17:00:00 2
40 Duck 2023-03-08 17:30:00 0
This works. However, now suppose that my data has additional grouping variables --- say, I'm observing birds at several sites. Of course, since I can't be in two places at the same time, each site has a different observer. And different observers have different working hours, so .start and .end must be set on a per-group basis.
The start/end times are available in my data, but .start and .end apparently can't be pulled from the tsibble being operated on:
> waterfowlSumm |> mutate(Start = d, End = d + hours(9) + minutes(30)) |> fill_gaps(`Total birds` = 0, .start = Start, .end = End)
Error in scan_gaps.tbl_ts(.data, .full = !!enquo(.full), .start = .start, :
object 'Start' not found
So my question is: how do I do this? I'd really like to be able to use grouping (in this example I only have one group to begin with, but in reality there are many) so I only have to invoke fill_gaps once, with the correct start/end being pulled from the tsibble.
Thanks!
The
fill_gaps()function converts the implicit missing values into explicit missing values, based on either the local (per series) or global (per dataset) start and end dates and the index class.Using
fill_gaps()without specifying the.startand.enddate will compute the time range for each series, and fill in any missing time points based on the data's time interval. This should work for your problem of different counting ranges for sites and birds.However if you are working with multiple days, the
fill_gaps()function will also add in the overnight hours between working days (as the interval is 30 minutes, and data is missing overnight). So you might want to instead fill implicit missing values with NA, and then maintain a working hours dataset that can be joined onto your observations data and used to convert NA to 0 if someone was working. For example:Created on 2023-03-10 with reprex v2.0.2