For a simple data frame:
client_id<-c("111","111","111","112","113","113","114")
transactions<-c(1,2,2,2,3,17,100)
transactions_sum<-c(5,5,5,2,20,20,100) ##precalculated sums of transaction counts for each client_id
segment<-c("low","low","low","low","low","low","high")
test<-data.frame(client_id,transactions,transactions_sum,segment)
client_id transactions transactions_sum segment
1 111 1 5 low
2 111 2 5 low
3 111 2 5 low
4 112 2 2 low
5 113 3 20 low
6 113 17 20 low
7 114 100 100 high
I'm trying to aggregate by segment and calculate segment means.
I'm expecting following results:
segment transactions_mean
1 low 9
2 high 100
Since calculating the mean should take into account repeating client_ids, we should sum individual transaction counts for each segment (1+2+2+2+3+17 for low segment) and divide by unique client_ids (3 for low segment), getting 27/3 = 9 for low segment. Using precalculated sums for each client_id: (5+2+20)/3 = 9
However, when I'm trying to run "dcast" or "aggregate" on this data, I'm getting wrong numbers, since apparently they treat each line as a unique observation:
dcast(test, segment ~ ., mean, value.var="transactions")
gives
segment .
1 low 4.5
2 high 100.0
which is effectively telling that it sums transaction counts per segment (1+2+2+2+3+17 for low segment) and divides by number of observations per segment (6 for low segment) and not unique client_ids.
What is a correct way of calculating means in this case?
We can use
data.table