Proper way of calculating means when aggregating data in long format

Question

Proper way of calculating means when aggregating data in long format

75 views Asked by kkgery At 02 January 2017 at 12:17

For a simple data frame:

client_id<-c("111","111","111","112","113","113","114")
transactions<-c(1,2,2,2,3,17,100)
transactions_sum<-c(5,5,5,2,20,20,100) ##precalculated sums of transaction counts for each client_id
segment<-c("low","low","low","low","low","low","high")
test<-data.frame(client_id,transactions,transactions_sum,segment)


  client_id transactions transactions_sum segment
1       111            1                5     low
2       111            2                5     low
3       111            2                5     low
4       112            2                2     low
5       113            3               20     low
6       113           17               20     low
7       114          100              100    high

I'm trying to aggregate by segment and calculate segment means.

I'm expecting following results:

    segment transactions_mean
1   low     9
2   high    100

Since calculating the mean should take into account repeating client_ids, we should sum individual transaction counts for each segment (1+2+2+2+3+17 for low segment) and divide by unique client_ids (3 for low segment), getting 27/3 = 9 for low segment. Using precalculated sums for each client_id: (5+2+20)/3 = 9

However, when I'm trying to run "dcast" or "aggregate" on this data, I'm getting wrong numbers, since apparently they treat each line as a unique observation:

 dcast(test, segment ~ ., mean, value.var="transactions")

gives

       segment     .
1          low   4.5
2         high 100.0

which is effectively telling that it sums transaction counts per segment (1+2+2+2+3+17 for low segment) and divides by number of observations per segment (6 for low segment) and not unique client_ids.

What is a correct way of calculating means in this case?

Original Q&A

There are 3 answers

mbiella On 02 January 2017 at 12:32

You can use this:

meanLow <- mean(test$segment == "low")
meanHigh <- mean(test$segment == "high")

ggsdc On 02 January 2017 at 12:37

You can also use dplyr

library(dplyr)
test_2 <- test %>%
  group_by(segment) %>%
  summarise (meanTransactions=sum(transactions)/n_distinct(client_id))

test_2

# A tibble: 2 × 2
  segment transactions
    <chr>        <dbl>
1    high          100
2     low            9

**akrun** · Accepted Answer · 2017-01-02T12:25:49+00:00

akrun On 02 January 2017 at 12:25 BEST ANSWER

We can use data.table

library(data.table)
setDT(test)[, .(transactions_mean = sum(transactions)/uniqueN(client_id)), by = segment]
#    segment transactions_mean
#1:     low                 9
#2:    high               100

TechQA.

Proper way of calculating means when aggregating data in long format

There are 3 answers

Related Questions in R

Related Questions in AGGREGATE

Related Questions in DCAST

Popular Questions

Popular Tags

Trending Questions