analyzing set of many dummy columns based on date ranges

Question

analyzing set of many dummy columns based on date ranges

24 views Asked by Ben At 27 June 2023 at 17:28

I am analyzing staff turnover and need to create (1) a count of the number of staff hired and exited in a given year and (2) compute a cumulative "total staff" count across years. I have hire and exit dates like this:

ssh<-structure(list(HireDate = structure(c(1358, 4291, 5121, 6923, 9678, 12037, 16353, 17003, 18976, 19312, 19312, 19011), class = "Date"),     ExitDate = structure(c(15861, 15401, 17140, 17347, NA, NA,     16911, 18856, 19193, NA, NA, NA), class = "Date"), id = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12")), row.names = c(NA, -12L), class = c("tbl_df", "tbl", "data.frame"))

ssh$hireyear<-lubridate::year(ssh$HireDate)
ssh$exityear<-lubridate::year(ssh$ExitDate)

ssh$group<-c("a","b","c","a","","b","a","","","b","b","c")

For the simple accounting of hires and exits, I'd like to have a dummy variable for EACH year. So for the data above, if staff was hired in 2014, create a new column hired2014 equal to 1, else 0, like this:

 ssh$hire1984<-ifelse(ssh$hireyear==1984,1,0)

or

 ssh$exit2012<-ifelse(ssh$exityear==2012,1,0)

My full dataset ranges between 1972 and 2023, so I'd like an efficient method to compute all possible variables for any date range. This would yield a dataframe with many columns -- one for each year.

Next, I'd like to format the resulting dataframe by year, something like this:

Year  NumberHired  NumberExited  NetChange  CurrentTotal
1972  4            0             4          4
1973  2            1             1          5
1974  3            4             -1         4
.
.
2010  25           11            14         541   ...etc

I experimented creating a lookup table of years that I might populate with an aggregation of the dummy variables but am hitting a wall. Also, other solutions I've found on stackoverflow tend to focus on creating only one dummy variable.

Ideas? Thx!

Original Q&A

There are 1 answers

**Gregor Thomas** · Accepted Answer · 2023-06-27T18:06:22+00:00

We can use tidyr::pivot_wider to make your dummies:

library(tidyr)
library(dplyr)

ssh |>
  mutate(dummy = 1) |>
  pivot_wider(
    names_from = hireyear, 
    values_from = dummy,
    names_prefix = "hired",
    values_fill = 0
  ) |>
  mutate(dummy = 1) |> 
  pivot_wider(
    names_from = exityear, 
    values_from = dummy,
    names_prefix = "exit",
    values_fill = 0
  )
# # A tibble: 12 × 20
#    HireDate   ExitDate   id    hired1973 hired1981 hired1984 hired1988 hired1996 hired2002
#    <date>     <date>     <chr>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
#  1 1973-09-20 2013-06-05 1             1         0         0         0         0         0
#  2 1981-10-01 2012-03-02 2             0         1         0         0         0         0
#  3 1984-01-09 2016-12-05 3             0         0         1         0         0         0
#  4 1988-12-15 2017-06-30 4             0         0         0         1         0         0
#  5 1996-07-01 NA         5             0         0         0         0         1         0
#  6 2002-12-16 NA         6             0         0         0         0         0         1
#  7 2014-10-10 2016-04-20 7             0         0         0         0         0         0
#  8 2016-07-21 2021-08-17 8             0         0         0         0         0         0
#  9 2021-12-15 2022-07-20 9             0         0         0         0         0         0
# 10 2022-11-16 NA         10            0         0         0         0         0         0
# 11 2022-11-16 NA         11            0         0         0         0         0         0
# 12 2022-01-19 NA         12            0         0         0         0         0         0
# # ℹ 11 more variables: hired2014 <dbl>, hired2016 <dbl>, hired2021 <dbl>, hired2022 <dbl>,
# #   exit2013 <dbl>, exit2012 <dbl>, exit2016 <dbl>, exit2017 <dbl>, exitNA <dbl>,
# #   exit2021 <dbl>, exit2022 <dbl>

And the summary table is fairly easily computed from your original data:

hires = count(ssh, hireyear, name = "NumberHired") |>
  rename(year = hireyear)
exits = count(ssh, exityear, name = "NumberExited", ) |>
  rename(year = exityear) |>
  filter(!is.na(year))
full_join(hires, exits, by = "year") %>%
  replace(is.na(.), 0) |>
  arrange(year) |>
  mutate(
    NetChange = NumberHired - NumberExited,
    CurrentTotal = cumsum(NetChange)
  )
# # A tibble: 13 × 5
#     year NumberHired NumberExited NetChange CurrentTotal
#    <dbl>       <int>        <int>     <int>        <int>
#  1  1973           1            0         1            1
#  2  1981           1            0         1            2
#  3  1984           1            0         1            3
#  4  1988           1            0         1            4
#  5  1996           1            0         1            5
#  6  2002           1            0         1            6
#  7  2012           0            1        -1            5
#  8  2013           0            1        -1            4
#  9  2014           1            0         1            5
# 10  2016           1            2        -1            4
# 11  2017           0            1        -1            3
# 12  2021           1            1         0            3
# 13  2022           3            1         2            5

TechQA.

analyzing set of many dummy columns based on date ranges

There are 1 answers

Related Questions in R

Related Questions in DPLYR

Related Questions in DUMMY-VARIABLE

Popular Questions

Trending Questions