I would like to create a new variable that:

  • summarises the count of one column
  • then assigns a rank to the group according to the highest count in ascending order
  • and if two (or more) cummulative counts are the same, it assigns it the next number.

Here's a test tibble:

df.test <- tibble(
  "group"  = c(1, 1, 2, 2, 2, 3, 3, 2, 2, 1, 4),
  "score" = c(0, 2, 1, 3, 2, 2, 1, 2, 0, 0, 2)
)

and this is what I would like it to look like:

# A tibble: 11 x 4
   group score cum.score  rank
   <dbl> <dbl>     <dbl> <dbl>
 1     1     0         0     1
 2     1     2         2     1
 3     2     1         1     4
 4     2     3         4     4
 5     2     2         6     4
 6     3     2         2     3
 7     3     1         3     3
 8     2     2         8     4
 9     2     0         8     4
10     1     0         2     1
11     4     2         2     2

Note that group 4 was assigned rank 2 as it had the same score as group 1 who was assigned rank 1.

This is the closest I got (which is not close):

df.test %>%
  group_by(group) %>%
  mutate(cum.score = cumsum(score),
         rank = last(cum.score))

but the rank just ascribes the total to them all instead of an actual rank.

3 Answers

2
MrFlick On Best Solutions

One way to do this would be to calculate the groups ranks in a sub-table and join that back to the original. You can do that with

df.test %>%
  group_by(group) %>%
  mutate(cum.score = cumsum(score)) %>% {
    x <- .
    x %>% summarize(max = max(cum.score)) %>% 
    mutate(rank=rank(max, ties.method ="first"), max=NULL) %>% 
    left_join(x, .)
  }

This seems necessary because you want values assigned within-group but you want the rank assigned across-groups.

2
IceCreamToucan On

I realise this is tagged with dplyr, but since there's already a good dplyr solution I'll add a data.table version.

In data.table you can join two tables a and b, adding a column of b (say b$x) to a with a[b, on = .(somevar), x := i.x]

library(data.table)
setDT(df.test)

df.test[, cum.score := cumsum(score), group]
df.test[df.test[, max(cum.score), group][, V1 := frank(V1, ties.method = 'first')]
        , on = .(group), rank := i.V1]

df.test
#     group score cum.score rank
#  1:     1     0         0    1
#  2:     1     2         2    1
#  3:     2     1         1    4
#  4:     2     3         4    4
#  5:     2     2         6    4
#  6:     3     2         2    3
#  7:     3     1         3    3
#  8:     2     2         8    4
#  9:     2     0         8    4
# 10:     1     0         2    1
# 11:     4     2         2    2
2
G. Grothendieck On

First define cum.score within group. Also define rank within group as a logical variable which is TRUE for the last value. Then outside of the grouping rank the last cum.score values using NA for the values other than last within group, i.e. those values assigned FALSE in the prior step. Finally in a second grouping fill the NAs with the rank for that group.

df.test %>%
  group_by(group) %>%
  mutate(cum.score = cumsum(score), rank = 1:n() == n()) %>%
  ungroup %>%
  mutate(rank = replace(NA * score, rank, rank(cum.score[rank], ties = "first"))) %>%
  group_by(group) %>%
  mutate(rank = na.omit(rank)) %>%
  ungroup