Creating Dummy Variables by Date Range in R

707 views Asked by At

I would like to create dummy variables that indicate which product version was in effect based on the date range, i.e. between the release date and the following version release date without doing it manually. I have a few hundred versions with their corresponding release dates and when the next version was released, and I will merge/join this data with a Purchases data frame.

Currently I have two data frames (Version and Purchases) that look like this:

View(Version)   
Type Version    Release_Date    Next_Release
A       1.2.3   2013-11-14     2014-01-11
B       1.3.1   2014-01-11     2014-02-20
A       1.5.1   2014-02-20     2014-03-08
A       1.5.2   2014-03-08     2014-04-06
B       1.5.3   2014-04-06     2014-04-12
A       1.5.4   2014-04-12     2014-04-15
B       1.5.5   2014-05-15     2014-05-20
B       1.6.1   2014-05-20     2014-06-26
A       1.6.2   2014-06-26     2014-07-14

View(Purchases)
TIMESTAMP   Amount 
2013-11-14   15.44
2013-11-14   13.39
2013-11-14   15.35
2013-11-15   86.43
2014-01-15   12.30
2014-01-17   23.55

I would like to create dummy variables for each version in the data frame 'Purchases' according to the date range which the version was in effect. i.e. if the timestamp of the purchase is within the date range that that version was in effect, that version dummy = 1 otherwise 0.

View(Purchases)
TIMESTAMP   Amount Version_1.2.3  Version_1.3.1  ....
2013-11-14   15.44      1              0
2013-11-14   13.39      1              0
2013-11-14   15.35      1              0
2013-11-15   86.43      1              0
2014-01-15   12.30      0              1
2014-01-17   23.55      0              1 
....

Thanks you in advance for any advise or help.

1

There are 1 answers

0
Uwe On BEST ANSWER

The task can be divided in two sub-tasks:

  1. find the most recent Release_Date before the TIMESTAMP of the purchase,
  2. reshape from long to wide format.

For finding the most recent release, the data.table package offers two possibilities,

a rolling join

library(data.table)
setDT(Version)
setDT(Purchases)

Version[Purchases, on = .(Release_Date = TIMESTAMP), roll = TRUE,
        .(TIMESTAMP = i.TIMESTAMP, Amount, Version)]
    TIMESTAMP Amount Version
1: 2013-11-14  15.44   1.2.3
2: 2013-11-14  13.39   1.2.3
3: 2013-11-14  15.35   1.2.3
4: 2013-11-15  86.43   1.2.3
5: 2014-01-15  12.30   1.3.1
6: 2014-01-17  23.55   1.3.1

or a non-equi join

Version[Purchases, on = .(Release_Date <= TIMESTAMP), mult = "last",
        .(TIMESTAMP = i.TIMESTAMP, Amount, Version)]

which produces the same result.

For reshaping, the dcast() function is used with length() for aggregation:

# rolling join
Version[Purchases, on = .(Release_Date = TIMESTAMP), roll = TRUE,
        .(TIMESTAMP = i.TIMESTAMP, Amount, Version)][
  , dcast(.SD, TIMESTAMP + Amount ~ Version, length)]
    TIMESTAMP Amount 1.2.3 1.3.1
1: 2013-11-14  13.39     1     0
2: 2013-11-14  15.35     1     0
3: 2013-11-14  15.44     1     0
4: 2013-11-15  86.43     1     0
5: 2014-01-15  12.30     0     1
6: 2014-01-17  23.55     0     1

or, in case the columns shall be renamed during reshaping

# non-equi join
Version[Purchases, on = .(Release_Date <= TIMESTAMP), mult = "last",
        .(TIMESTAMP = i.TIMESTAMP, Amount, Version)][
  , dcast(.SD, TIMESTAMP + Amount ~ paste0("Version_", Version), length)]
    TIMESTAMP Amount Version_1.2.3 Version_1.3.1
1: 2013-11-14  13.39             1             0
2: 2013-11-14  15.35             1             0
3: 2013-11-14  15.44             1             0
4: 2013-11-15  86.43             1             0
5: 2014-01-15  12.30             0             1
6: 2014-01-17  23.55             0             1