Calculate lags across neighboring column with many columns in dplyr

41 views Asked by At

Start with this code:

set.seed(0)

the_df <- tibble(date=seq.Date(ymd('20230101'),ymd('20230101')+days(9), by='days'),
                 lead1=rep(1:10),
                 lead2=runif(10),
                 lead3=runif(10),
                 lead4=runif(10)) 

the_df %>% 
  mutate(lag2=lead2-dplyr::lag(lead1,1),
         lag3=lead3-dplyr::lag(lead2,1),
         lag4=lead4-dplyr::lag(lead3,1),)

The output will look like:

enter image description here

Now imagine a tibble with hundreds of columns. How can this lagged difference from one column the next be replicated in vector form for all columns?

I.e., if the tibble has lead1 to lead 100, the result would be lag2 to lag100.

The statement

the_df %>%
  mutate(across(lead1:lead4, ~ dplyr::lag(.x,1), .names="d_{.col}"))

calculates the lag of each column within the column and creates a new column of data with that result. How do I calculate the difference between two columns within the across function as in the simple four-column example but for all columns?

1

There are 1 answers

1
Mark On BEST ANSWER

Update: a one-liner:

the_df |> mutate(across(lead2:last_col(), ~ .x - lag(get(names(the_df)[which(cur_column() == names(the_df)) - 1])), .names = "lag_{.col}"))

Original:

Make the data long, then get the name of the previous column::

the_df |>
  tidyr::pivot_longer(-date, names_to='lead')
  dplyr::mutate(prev_col = lag(lead)) -> df

Left join this intermediate df with itself, getting the previous column's value as a new column. Then calculate the value minus the previous column's lagged value. Then remove the superfluous columns

df |>
  dplyr::left_join(df, by = c("date", "prev_col" = "lead")) |>
  dplyr::mutate(value = value.x - lag(value.y), .by = lead) |>
  dplyr::select(date, lead, value) -> output

To make the data wide again:

# make the lag data wide
tidyr::pivot_wider(output, names_from = lead, names_glue = "{paste0('lag', lead)}", values_from = value, id_cols = date) |>
# remove the useless lag1 column
  select(-lag1) |>
  # join it with the lead data
  full_join(the_df) |>
# then get the columns in the right order
  select(date, starts_with("lead"), starts_with("lag"))

Output:

# A tibble: 10 × 8
   date       lead1  lead2   lead3   lead4    lag2     lag3   lag4
   <date>     <int>  <dbl>   <dbl>   <dbl>   <dbl>    <dbl>  <dbl>
 1 2023-01-01     1 0.0271 0.973   0.908   NA      NA       NA    
 2 2023-01-02     2 0.985  0.520   0.603   -0.0149  0.493   -0.369
 3 2023-01-03     3 0.838  0.0891  0.0151  -1.16   -0.896   -0.505
 4 2023-01-04     4 0.199  0.00293 0.776   -2.80   -0.835    0.687
 5 2023-01-05     5 0.891  0.846   0.283   -3.11    0.647    0.280
 6 2023-01-06     6 0.337  0.902   0.272   -4.66    0.0110  -0.574
 7 2023-01-07     7 0.787  0.939   0.303   -5.21    0.602   -0.599
 8 2023-01-08     8 0.747  0.159   0.133   -6.25   -0.628   -0.805
 9 2023-01-09     9 0.704  0.753   0.00196 -7.30    0.00534 -0.157
10 2023-01-10    10 0.908  0.976   0.876   -8.09    0.272    0.123