Linear interpolation time series of data through columns

108 views Asked by At

I would like to use linear interpolation to replace NA values on a Df. On my Df columns represent a time series of daily data, so the Df in on a wide format. My data represents Days (in columns) and values of a variable, temperature, in rows. So the DF looks something like below (this is way simplified)

Tempdf <- data.frame ("Day1"= c(20, 22, 19, 28, NA),
"Day2" = c(NA, 24, NA, NA, 28) , "Day3"=c(23, 26, NA, NA, 29), "Day4"= c(25, 24, NA, 29, 30),"Day5"=c(24, NA, 22, 28, 29))

I've got a long time series of data (more than 1000 days in which some days are NA values) so I would like to interpolate those NA values based on the data within the time series. I am not sure if I need to specify a range, lets say for each NA look at the values on the same row, 3 columns before and after the NA (for example) and then interpolate that value. I need something like this as if the interpolation was made based on all columns it would look at daily values over the years so the result would be very different to the temperature data on days closer to when the NA record was missing.

So let's say temperature on Day1:Day5 was as in my example: 20, NA, 23,25,24, I would expect that NA to be around 23.

I have tried with zoo package but haven't have much luck.

Temp2 <- na.approx(Temp1)

Thanks!

2

There are 2 answers

14
Allan Cameron On

You could use approx inside apply. Note that for missing values in the first or last column, this will simply copy over the adjacent value rather than trying to extrapolate a trend. For missing values in the inner columns, the value will be interpolated between adjacent non-missing columns.

Tempdf |>
  apply(1, \(x) {
    if(length(which(!is.na(x))) == 1) return(rep(x[!is.na(x)], length(x)))
    approx(1:length(x), x, xout = 1:length(x), rule = 2)$y
    }) |>
  t() |>
  as.data.frame() |>
  setNames(names(Tempdf))
#>   Day1 Day2 Day3 Day4 Day5
#> 1   20 21.5 23.0 25.0   24
#> 2   22 24.0 26.0 24.0   24
#> 3   19 21.0 23.0 22.5   22
#> 4   28 30.0 29.5 29.0   28
#> 5   28 28.0 29.0 30.0   29
5
G. Grothendieck On

na.approx does linear approximation so if you want the NA in

x <- c(20, NA, 23, 25, 24)

to be near 23 then you need something different.

Try this where width is the number of points minus the number of consecutive NAs to average and can optionally be omitted in which case it is set to 2 + max number of consecutive NAs.

library(zoo)

interp.vec <- function(x, width) {
  if (missing(width)) width <- with(rle(is.na(x)), max(lengths[values])) + 2
  ifelse(is.na(na.fill(x, c(NA, 0, NA))),
    na.fill(x, c("extend", NA, "extend")),
    rollapply(x, width, mean, na.rm = TRUE, partial = TRUE, fill = NA)
  )
}

interp.vec(x, 5)
## [1] 21.50000 22.66667 23.00000 24.00000 24.00000

nms <- names(Tempdf)
Tempdf |>
  apply(1, interp.vec, width = 5) |>
  t() |>
  as.data.frame() |>
  setNames(nms)
##   Day1     Day2     Day3     Day4     Day5
## 1 21.5 22.66667 23.00000 24.00000 24.00000
## 2 24.0 24.00000 24.00000 24.66667 24.00000
## 3 19.0 19.00000 20.50000 22.00000 22.00000
## 4 28.0 28.50000 28.33333 28.50000 28.50000
## 5 28.0 29.00000 29.00000 29.00000 29.33333