Proper way of calculating means when aggregating data in long format

75 views Asked by At

For a simple data frame:

client_id<-c("111","111","111","112","113","113","114")
transactions<-c(1,2,2,2,3,17,100)
transactions_sum<-c(5,5,5,2,20,20,100) ##precalculated sums of transaction counts for each client_id
segment<-c("low","low","low","low","low","low","high")
test<-data.frame(client_id,transactions,transactions_sum,segment)


  client_id transactions transactions_sum segment
1       111            1                5     low
2       111            2                5     low
3       111            2                5     low
4       112            2                2     low
5       113            3               20     low
6       113           17               20     low
7       114          100              100    high

I'm trying to aggregate by segment and calculate segment means.

I'm expecting following results:

    segment transactions_mean
1   low     9
2   high    100     

Since calculating the mean should take into account repeating client_ids, we should sum individual transaction counts for each segment (1+2+2+2+3+17 for low segment) and divide by unique client_ids (3 for low segment), getting 27/3 = 9 for low segment. Using precalculated sums for each client_id: (5+2+20)/3 = 9

However, when I'm trying to run "dcast" or "aggregate" on this data, I'm getting wrong numbers, since apparently they treat each line as a unique observation:

 dcast(test, segment ~ ., mean, value.var="transactions")

gives

       segment     .
1          low   4.5
2         high 100.0

which is effectively telling that it sums transaction counts per segment (1+2+2+2+3+17 for low segment) and divides by number of observations per segment (6 for low segment) and not unique client_ids.

What is a correct way of calculating means in this case?

3

There are 3 answers

1
akrun On BEST ANSWER

We can use data.table

library(data.table)
setDT(test)[, .(transactions_mean = sum(transactions)/uniqueN(client_id)), by = segment]
#    segment transactions_mean
#1:     low                 9
#2:    high               100
0
mbiella On

You can use this:

meanLow <- mean(test$segment == "low")
meanHigh <- mean(test$segment == "high")
0
ggsdc On

You can also use dplyr

library(dplyr)
test_2 <- test %>%
  group_by(segment) %>%
  summarise (meanTransactions=sum(transactions)/n_distinct(client_id))

test_2

# A tibble: 2 × 2
  segment transactions
    <chr>        <dbl>
1    high          100
2     low            9