Calculate lags across neighboring column with many columns in dplyr

Question

Calculate lags across neighboring column with many columns in dplyr

101 views Asked by JFD At 26 October 2023 at 01:23

Start with this code:

set.seed(0)

the_df <- tibble(date=seq.Date(ymd('20230101'),ymd('20230101')+days(9), by='days'),
                 lead1=rep(1:10),
                 lead2=runif(10),
                 lead3=runif(10),
                 lead4=runif(10)) 

the_df %>% 
  mutate(lag2=lead2-dplyr::lag(lead1,1),
         lag3=lead3-dplyr::lag(lead2,1),
         lag4=lead4-dplyr::lag(lead3,1),)

The output will look like:

Now imagine a tibble with hundreds of columns. How can this lagged difference from one column the next be replicated in vector form for all columns?

I.e., if the tibble has lead1 to lead 100, the result would be lag2 to lag100.

The statement

the_df %>%
  mutate(across(lead1:lead4, ~ dplyr::lag(.x,1), .names="d_{.col}"))

calculates the lag of each column within the column and creates a new column of data with that result. How do I calculate the difference between two columns within the across function as in the simple four-column example but for all columns?

Original Q&A

There are 1 answers

**Mark** · Accepted Answer · 2023-10-26T02:25:51+00:00

Update: a one-liner:

the_df |> mutate(across(lead2:last_col(), ~ .x - lag(get(names(the_df)[which(cur_column() == names(the_df)) - 1])), .names = "lag_{.col}"))

Original:

Make the data long, then get the name of the previous column::

the_df |>
  tidyr::pivot_longer(-date, names_to='lead')
  dplyr::mutate(prev_col = lag(lead)) -> df

Left join this intermediate df with itself, getting the previous column's value as a new column. Then calculate the value minus the previous column's lagged value. Then remove the superfluous columns

df |>
  dplyr::left_join(df, by = c("date", "prev_col" = "lead")) |>
  dplyr::mutate(value = value.x - lag(value.y), .by = lead) |>
  dplyr::select(date, lead, value) -> output

To make the data wide again:

# make the lag data wide
tidyr::pivot_wider(output, names_from = lead, names_glue = "{paste0('lag', lead)}", values_from = value, id_cols = date) |>
# remove the useless lag1 column
  select(-lag1) |>
  # join it with the lead data
  full_join(the_df) |>
# then get the columns in the right order
  select(date, starts_with("lead"), starts_with("lag"))

Output:

# A tibble: 10 × 8
   date       lead1  lead2   lead3   lead4    lag2     lag3   lag4
   <date>     <int>  <dbl>   <dbl>   <dbl>   <dbl>    <dbl>  <dbl>
 1 2023-01-01     1 0.0271 0.973   0.908   NA      NA       NA    
 2 2023-01-02     2 0.985  0.520   0.603   -0.0149  0.493   -0.369
 3 2023-01-03     3 0.838  0.0891  0.0151  -1.16   -0.896   -0.505
 4 2023-01-04     4 0.199  0.00293 0.776   -2.80   -0.835    0.687
 5 2023-01-05     5 0.891  0.846   0.283   -3.11    0.647    0.280
 6 2023-01-06     6 0.337  0.902   0.272   -4.66    0.0110  -0.574
 7 2023-01-07     7 0.787  0.939   0.303   -5.21    0.602   -0.599
 8 2023-01-08     8 0.747  0.159   0.133   -6.25   -0.628   -0.805
 9 2023-01-09     9 0.704  0.753   0.00196 -7.30    0.00534 -0.157
10 2023-01-10    10 0.908  0.976   0.876   -8.09    0.272    0.123

TechQA.

Calculate lags across neighboring column with many columns in dplyr

There are 1 answers

Update: a one-liner:

Original:

Related Questions in DPLYR

Related Questions in ACROSS

Popular Questions

Popular Tags

Trending Questions