cumulative sum over dummy variable by date in R

1.5k views Asked by At

I was able to do this data ETL work in Python. However, as I need to integrate with R and I am new to R, I post the question here. I want to explode the dates based on start_date and end_date and do the cumulative frequency summing over the dummy variables derived from the variable "type"

The original data has 3 columns with variable name start_date, end_date and type

start_date, end_date, type
 1/1/2016,  1/3/2016,   A
 1/2/2016,  1/2/2016,   B
 1/2/2016,  1/3/2016,   A

Here is an explanation of what I tried to achieve.

For the 1st row of record, type A appear on every day from 1/1 to 1/3(both start and end dates are included).

Now on the 2nd row, type B appear on 1/2 only.

So far, 1/1 has one 'A', 1/2 has one 'A' and one 'B', 1/3 has one 'A'.

Such process repeats for the rest of records. In reality, I have a lot of such rows and a lot of different values in variable "type"

Basically, I need an efficient algorithm to do the frequency count for all variables in variable "type" for each day, yielding a data frame with dates as index column and the corresponding frequency count in all the unique variables in variable "type". Hope it clarifies.

I need the data frame in the following format with the first row as the new header

 date,      A,       B
 1/1/2016,  1,       0
 1/2/2016,  2,       1
 1/3/2016,  2,       0

It seems @tiffany's solution did not work as expected. His/her nested loop code part breaks down for my following sample code.

start_date  end_date    type
1/1/16  1/3/16  A
1/1/16  1/3/16  A
1/1/16  1/8/16  B
1/1/16  1/14/16 B
1/5/16  1/19/16 B
1/7/16  1/13/16 C
1/9/16  1/18/16 A
1/13/16 1/19/16 D
1/13/16 1/19/16 A
1/14/16 1/22/16 B
1/15/16 1/29/16 B
1/16/16 1/22/16 D

The correct part is:

results <- data.frame(date = dates)

for(t in unique(df$type)) {
  for(d in dates) {
    results[results$date == d, t] <- 
      length(df[df$start_date <= d & df$end_date >= d & df$type == t],'type')
  }
}

Thanks for the help in advance. To show I am not lazy in the spirit of stackover flow community, this is the Python version I wrote:

import pandas as pd

df = pd.read_csv("dates.csv")

factor_type = list(df['type'].unique())

columns = ['date']
columns.extend(factor_type)


result = []

dates_dict = {}
i = 0


for index,row in df.iterrows():
    start_end = pd.date_range(row['start_date'], row['end_date'])
    factor = row['variable_type']
    factor_index = factor_type.index(factor)
    for x in start_end:
        date_obj = x.date()
        date_str = '%s/%s/%s' % (date_obj.month, date_obj.day,date_obj.year)
        if date_str in dates_dict:
            row_index = dates_dict[date_str]
            result[row_index+1][factor_index+1]+=1
        else:
            dummy_row = [0]*len(factor_type)
            dummy_row[factor_index]=1
            result.append([date_str]+dummy_row)
            dates_dict[date_str]=i+1


result_df = pd.DataFrame(result,columns=columns)  
3

There are 3 answers

0
David Arenburg On BEST ANSWER

Here are two ways using data table- one is efficient but harder to read, the second is less efficient but easier to read.

First, convert both columns to proper date classes (I'm using data.tables as.IDate function for inner integer representation, rather numeric one)

library(data.table) 
cols <- c("start_date", "end_date")
setDT(df)[, (cols) := lapply(.SD, as.IDate, format = "%m/%d/%Y"), .SDcols = cols]

Less efficient solution

A simple (but not so efficient) way is to expand dates by row (which was already offered), and then do a simple dcast which is both very efficient and doesn't care how many levels you have in type

res <- df[, .(Date = seq.int(start_date, end_date, 1L), type), by = 1:nrow(df)]
dcast(res, Date ~ type, length)
# Using 'type' as value column. Use 'value.var' to override
#          Date A B
# 1: 2016-01-01 1 0
# 2: 2016-01-02 2 1
# 3: 2016-01-03 2 0

More efficient solution

This solution doesn't involve by row operations, rather operates over the overall range of dates using the foverlaps function. The first step (also, like was already offered) is to create an overall range, set it as start and end range, and set a key (for further operations)

Intervals <- data.table(start_date = df[, seq.int(min(start_date), max(end_date), 1L)]) # overall range
Intervals[, end_date := start_date] # set start/end ranges as same values
setkey(Intervals, start_date, end_date) # key

Now all is left is to run foverlaps and the convert to wide format using dcast again

dcast(foverlaps(df, Intervals), start_date ~ type, length)
# Using 'type' as value column. Use 'value.var' to override
#    start_date A B
# 1: 2016-01-01 1 0
# 2: 2016-01-02 2 1
# 3: 2016-01-03 2 0
1
Dries On

I want to provide a dplyr-solution.

First, I happily borrow tiffany's work to construct dataframe df. Then

  • Make a list with the dates from start to end

     df2<-df%>%
             rowwise()%>%
             mutate(dates = list(as_date(start_date:end_date)))
    
  • List all those dates, attach the right types and then group by date and sum

      df3<-bind_rows(apply(df2,1,function(x){
                       data.frame(Date = unlist(x$dates))%>%mutate(type=x$type[1])
         }))%>%
          group_by(Date)%>%
          summarise(A = sum(type=="A"),
                    B = sum(type=="B"))
    
3
tiffany On

I'm not sure I completely understand what you're looking for (you say "cumulative sum" but I think you're really looking to count the number of open items.)

If that's the case, here's some (relatively dirty) code that would give you what you want, for each date between your earliest start_date and your latest end_date.

library(lubridate)
start_date <- c("1/1/2016", "1/2/2016", "1/2/2016")
end_date <- c("1/3/2016", "1/2/2016", "1/3/2016")
type <- c("A", "B", "A")

Convert strings to dates to make what comes next easier.

df <- data.frame(start_date, end_date, type)
df$start_date <- as.Date(mdy(df$start_date))
df$end_date <- as.Date(mdy(df$end_date))

Make a vector of dates between your earliest start_date and latest end_date.

dates <- seq(from = min(c(df$start_date, df$end_date)),
             to = max(c(df$start_date, df$end_date)),
             by = 1)

Get the data in the format you want:

results <- data.frame(date = dates, openA = NA, openB = NA)
for(d in dates) {
  results$openA[results$date == d] <- 
    length(df[df$start_date <= d & df$end_date >= d & df$type == "A"])

  results$openB[results$date == d] <- 
    length(df[df$start_date <= d & df$end_date >= d & df$type == "B"])
}

For an arbitrary number of types, you could do:

results <- data.frame(date = dates)

for(t in unique(df$type)) {
  for(d in dates) {
    results[results$date == d, t] <- 
      length(df[df$start_date <= d & df$end_date >= d & df$type == t])
  }
}