I was able to do this data ETL work in Python. However, as I need to integrate with R and I am new to R, I post the question here. I want to explode the dates based on start_date and end_date and do the cumulative frequency summing over the dummy variables derived from the variable "type"
The original data has 3 columns with variable name start_date, end_date and type
start_date, end_date, type
1/1/2016, 1/3/2016, A
1/2/2016, 1/2/2016, B
1/2/2016, 1/3/2016, A
Here is an explanation of what I tried to achieve.
For the 1st row of record, type A appear on every day from 1/1 to 1/3(both start and end dates are included).
Now on the 2nd row, type B appear on 1/2 only.
So far, 1/1 has one 'A', 1/2 has one 'A' and one 'B', 1/3 has one 'A'.
Such process repeats for the rest of records. In reality, I have a lot of such rows and a lot of different values in variable "type"
Basically, I need an efficient algorithm to do the frequency count for all variables in variable "type" for each day, yielding a data frame with dates as index column and the corresponding frequency count in all the unique variables in variable "type". Hope it clarifies.
I need the data frame in the following format with the first row as the new header
date, A, B
1/1/2016, 1, 0
1/2/2016, 2, 1
1/3/2016, 2, 0
It seems @tiffany's solution did not work as expected. His/her nested loop code part breaks down for my following sample code.
start_date end_date type
1/1/16 1/3/16 A
1/1/16 1/3/16 A
1/1/16 1/8/16 B
1/1/16 1/14/16 B
1/5/16 1/19/16 B
1/7/16 1/13/16 C
1/9/16 1/18/16 A
1/13/16 1/19/16 D
1/13/16 1/19/16 A
1/14/16 1/22/16 B
1/15/16 1/29/16 B
1/16/16 1/22/16 D
The correct part is:
results <- data.frame(date = dates)
for(t in unique(df$type)) {
for(d in dates) {
results[results$date == d, t] <-
length(df[df$start_date <= d & df$end_date >= d & df$type == t],'type')
}
}
Thanks for the help in advance. To show I am not lazy in the spirit of stackover flow community, this is the Python version I wrote:
import pandas as pd
df = pd.read_csv("dates.csv")
factor_type = list(df['type'].unique())
columns = ['date']
columns.extend(factor_type)
result = []
dates_dict = {}
i = 0
for index,row in df.iterrows():
start_end = pd.date_range(row['start_date'], row['end_date'])
factor = row['variable_type']
factor_index = factor_type.index(factor)
for x in start_end:
date_obj = x.date()
date_str = '%s/%s/%s' % (date_obj.month, date_obj.day,date_obj.year)
if date_str in dates_dict:
row_index = dates_dict[date_str]
result[row_index+1][factor_index+1]+=1
else:
dummy_row = [0]*len(factor_type)
dummy_row[factor_index]=1
result.append([date_str]+dummy_row)
dates_dict[date_str]=i+1
result_df = pd.DataFrame(result,columns=columns)
Here are two ways using data table- one is efficient but harder to read, the second is less efficient but easier to read.
First, convert both columns to proper date classes (I'm using data.tables
as.IDate
function for inner integer representation, rather numeric one)Less efficient solution
A simple (but not so efficient) way is to expand dates by row (which was already offered), and then do a simple
dcast
which is both very efficient and doesn't care how many levels you have intype
More efficient solution
This solution doesn't involve by row operations, rather operates over the overall range of dates using the
foverlaps
function. The first step (also, like was already offered) is to create an overall range, set it as start and end range, and set a key (for further operations)Now all is left is to run
foverlaps
and the convert to wide format usingdcast
again