Expand one column in a tibble and have the descriptive columns follow

573 views Asked by At

I want to use zoo::na.approx (but not married to this function) to fill in a response variable for the missing days in my dataframe. I'm having a tough time figuring out how to add the NAs to the original dataframe so that na.approx can fill them in.

My dataframe looks something like this:

 df<-data.frame(trt=c("A", "A", "A", "A", "B", "B", "B", "B"),
                day = c(1,3,7,9,1,5,8,9),
                value = c(7,12,5,7,5,6,11,8),
                stringsAsFactors=FALSE)

I want every day to be in the dataframe with "NA" for each day where I don't have data.

I have used something like this to expand my dataset:

library(dplyr)

days_possible <- expand.grid(
  day = seq(from=min(df$day), max(df$day), by=1),
  trt = c("A", "B"), 
  stringsAsFactors = FALSE
   )

new_df<- df %>%
   right_join(days_possible, by = c("trt", "day"))

My problem is that I have a bunch of sites, years, and a few treatment columns, so somewhere it seems to all get messed up and in my days_possible dataframe, I don't get it right.

Is there a function to avoid this mess, expand one column, and have all the other columns expand in a tidy fashion? I'm looking at modelr::data_grid, but by itself I am not sure how to get the final desired result - an ordered dataframe I can group by treatments and use approximation to fill in missing days.

2

There are 2 answers

4
www On BEST ANSWER

We can use the complete and full_seq functions from the tidyr package. The final as.data.frame() is not required. I just added it to print the output as a data frame.

library(tidyr)

df2 <- df %>% 
  complete(trt, day = full_seq(day, period = 1)) %>%
  as.data.frame()

df2
#    trt day value
# 1    A   1     7
# 2    A   2    NA
# 3    A   3    12
# 4    A   4    NA
# 5    A   5    NA
# 6    A   6    NA
# 7    A   7     5
# 8    A   8    NA
# 9    A   9     7
# 10   B   1     5
# 11   B   2    NA
# 12   B   3    NA
# 13   B   4    NA
# 14   B   5     6
# 15   B   6    NA
# 16   B   7    NA
# 17   B   8    11
# 18   B   9     8
0
G. Grothendieck On

We have added a value2 column to df in the Note at the end to show that this works with additional columns.

Note that df is not a time series and na.approx is intended to operate on time series. To convert it to one read it into zoo object wide0 and then merge it with a full set of days. Now we can apply na.approx directly as discussed.

library(magrittr)
library(zoo)

wide <- df %>% 
   read.zoo(index = "day", split = "trt") %>%
   merge(zoo(, start(.):end(.) + 0)) %>%
   na.approx

giving:

> wide
  value.A value2.A   value.B value2.B
1    7.00     1.00  5.000000 5.000000
2    9.50     1.50  5.250000 5.250000
3   12.00     2.00  5.500000 5.500000
4   10.25     2.25  5.750000 5.750000
5    8.50     2.50  6.000000 6.000000
6    6.75     2.75  7.666667 6.333333
7    5.00     3.00  9.333333 6.666667
8    6.00       NA 11.000000 7.000000
9    7.00       NA  8.000000 8.000000 

The NAs above are due to the fact that one cannot interpolate without values on both sides; however, na.approx does have additional arguments to fill those in if you want to.

The wide form with separate columns for each variable/group shown above may be the most convenient but if not we could convert it back to long form using fortify.zoo and possibly spread out the variables into one column each again.

library(tidyr)

wide %>%
     fortify.zoo(wide, melt = TRUE, sep = ".", 
         names = list("day", c("variable", "group"), "value")) %>%
     spread(variable, value)

giving:

   day group     value   value2
1    1     A  7.000000 1.000000
2    1     B  5.000000 5.000000
3    2     A  9.500000 1.500000
4    2     B  5.250000 5.250000
5    3     A 12.000000 2.000000
6    3     B  5.500000 5.500000
7    4     A 10.250000 2.250000
8    4     B  5.750000 5.750000
9    5     A  8.500000 2.500000
10   5     B  6.000000 6.000000
11   6     A  6.750000 2.750000
12   6     B  7.666667 6.333333
13   7     A  5.000000 3.000000
14   7     B  9.333333 6.666667
15   8     A  6.000000       NA
16   8     B 11.000000 7.000000
17   9     A  7.000000       NA
18   9     B  8.000000 8.000000

Note

The input in reproducible form. We have added a value2 column to show it still works.

df<-data.frame(trt=c("A", "A", "A", "A", "B", "B", "B", "B"),
                day = c(1,3,7,9,1,5,8,9),
                value = c(7,12,5,7,5,6,11,8),
                stringsAsFactors=FALSE)
df$value2 <- c(1:3, NA, 5:8)