Return value from multiple columns if the column is the most recent previous year

113 views Asked by At

I'd just like to have a table that contains the ID and a new categorical variable (called e.g. Actual_landcover, having 40 categories) which is derived from the variables LandcoverXXXX and Sub_landcoverXXXX.

I have a dataset that contains variables

ID,
Year (2014-2023),
Landcover2013 (categorical variable 1-4),
Landcover2015,
Landcover2017,
Landcover2019,
Landcover2021,
Sub_landcover2013 (subcategories 1-10),
Sub_landcover2015,
Sub_landcover2017,
Sub_landcover2019,
Sub_landcover2021

I'd like to add a new column in my dataset. The new column would contain a new categorical variable describing the actual landcover (4 x 10 = 40, because of all those categories and subcategories the values can range from 1 to 40, in theory). And because there is temporal variation in the landcover values, I want the clause to utilize the newest data available based on the variable 'Year'. For example, if the 'Year' is 2014 the data used for a new value is from variables 'Landcover2013' and 'Sub_landcover2013'. And for example, with the 'Year' 2015 I want to use the same data from 2013 because the landcover data is published at the end of those years mentioned in the existing column names.

df <- data.frame(ID = c("1", "2", "3", "4", "5"),
             Year = c("2014", "2014", "2016", "2017", "2023"),
             Landcover2013 = c("1", "1", "2", "1", "4"),
             Landcover2015 = c("1", "1", "3", "2", "4"),
             Landcover2017 = c("1", "1", "2", "2", "3"),
             Landcover2019 = c("2", "1", "2", "2", "4"),
             Landcover2021 = c("2", "1", "3", "1", "4"),
             Sub_landcover2013 = c("4", "7", "5", "9", "1"),
             Sub_landcover2015 = c("5", "7", "6", "9", "2"),
             Sub_landcover2017 = c("4", "6", "6", "9", "1"),
             Sub_landcover2019 = c("4", "6", "6", "9", "2"),
             Sub_landcover2021 = c("4", "6", "6", "10", "1"))

The following two tables are just examples of rules which could derive new values. The new values could be formed like:

    LandcoverXXXX Sub_landcoverXXXX New_value
1              1                 1         1
2              1                 2         2
3              1                 3         3
4              1                 4         4
5              1                 5         5
6              1                 6         6
7              1                 7         7
8              1                 8         8
9              1                 9         9
10             1                10        10
11             2                 1        11
12             2                 2        12
13             2                 3        13
14             2                 4        14
15             2                 5        15
16             2                 6        16
17             2                 7        17
18             2                 8        18
19             2                 9        19
20             2                10        20
21             3                 1        21

and so on...

OR like:

   LandcoverXXXX Sub_landcoverXXXX New_value
1              1                 1        11
2              1                 2        12
3              1                 3        13
4              1                 4        14
5              1                 5        15
6              1                 6        16
7              1                 7        17
8              1                 8        18
9              1                 9        19
10             1                10       110
11             2                 1        21
12             2                 2        22
13             2                 3        23
14             2                 4        24
15             2                 5        25
16             2                 6        26
17             2                 7        27
18             2                 8        28
19             2                 9        29
20             2                10       210
21             3                 1        31

and so on...

But those two tables are just examples of how the new values could be formed. The final output table could look like this:

  ID Actual_landcover
1  1               14
2  2               17
3  3               36
4  4               29
5  5               41
1

There are 1 answers

0
L Tyrone On

Edit: This answer was posted prior to the OP's update and is for replicating the pattern highlighted in the data image only.

Using tidyverse packages, here's a way to achieve your goal. The workflow:

  1. Get all your landcover types columns into a single column using pivot_longer()
  2. Create two new group columns, for for landcover type and one for landcover type years
  3. Group data by id and landcover type (e.g. "group" column)
  4. slice() each group by the maximum year in "group1" column that is less than the value in the "Year" column
  5. pivot_wider() to get two separate landcover type columns
library(dplyr)
library(tidyr)
library(stringr)

result <- df %>%
  pivot_longer(-c(id, Year, NEW_COLUMN)) %>%
  mutate(group = case_when(str_detect(name, "^Land") ~ "landcover",
                           TRUE ~ "sub_landcover"),
         group1 = as.integer(str_sub(name, start = -4))) %>%
  group_by(id, group) %>%
  slice(which.min(abs(as.integer(Year) - group1))) %>%
  ungroup() %>%
  pivot_wider(id_cols = c(id, Year),
              names_from = group,
              values_from = value)

result
# A tibble: 5 × 4
  id    Year  landcover sub_landcover
  <chr> <chr> <chr>     <chr>        
1 1     2014  1         4            
2 2     2014  1         7            
3 3     2016  3         6            
4 4     2017  2         9            
5 5     2023  4         1