I have data with a date, zip code and score. I would like to discretize the data such that all rows of the same month and same zip code above the mean for that same month and zip code get a 1, all others get a zero.
so example (data frame is called score_df):
date zip score
2014-01-02 12345 10
2014-01-03 12345 20
2014-01-04 12345 2
2014-01-05 99885 15
2014-01-06 99885 12
output:
date zip score above_avg
2014-01-02 12345 10 0
2014-01-03 12345 20 1
2014-01-04 12345 3 0
2014-01-05 99885 15 1
2014-01-06 99885 12 0
So far I have been using inefficient solutions:
1.Looping through all months and applying the binary condition with an ifelse statement
score_df$above_avg <- rep(0,length(score_df$score))
for (month in (1:12)) {
score_df$above_avg <- ifelse(as.numeric(substring(score_df$date,6,7)) == month,ifelse(score_df$score>quantile(score_df$score[as.numeric(substring(score_df$date,6,7)) == month],(0.5)),1,0),score_df$above_avg)
}
2.I also tried to generate an average table using aggregate, then joining the average column to the original data frame and then applying a binary condition
avg_by_month_zip <- aggregate(score~month+zip,data=score_df,FUN=mean)
score_df$mean <- sqldf("select * from score_df join avg_by_month_zip on avg_by_month_zip.zip = score_df.zip and avg_by_month_zip.month = score_df.month")
score_df$discrete <- ifelse(score_df$score>score_df$mean,1,0)
I would like to do this functionally. I know how to do it functionally with one condition (just date or just zip) but not with two. I could concatenate the two fields to make one unique field. That would be a quick fix, but I was wondering if there is a way to do this simply and efficiently with an apply function or plyr.
Assuming you have your date values properly encoded as such (for example)
then you can do
We use
ave()
to calculate the value for all the month/zip combinations (we usestrftime()
to get the month from the date).