# Make new variable based on final cumulative sum of another column

Asked by At

I would like to create a new variable that:

• summarises the count of one column
• then assigns a rank to the group according to the highest count in ascending order
• and if two (or more) cummulative counts are the same, it assigns it the next number.

Here's a test tibble:

``````df.test <- tibble(
"group"  = c(1, 1, 2, 2, 2, 3, 3, 2, 2, 1, 4),
"score" = c(0, 2, 1, 3, 2, 2, 1, 2, 0, 0, 2)
)
``````

and this is what I would like it to look like:

``````# A tibble: 11 x 4
group score cum.score  rank
<dbl> <dbl>     <dbl> <dbl>
1     1     0         0     1
2     1     2         2     1
3     2     1         1     4
4     2     3         4     4
5     2     2         6     4
6     3     2         2     3
7     3     1         3     3
8     2     2         8     4
9     2     0         8     4
10     1     0         2     1
11     4     2         2     2
``````

Note that group 4 was assigned rank 2 as it had the same score as group 1 who was assigned rank 1.

This is the closest I got (which is not close):

``````df.test %>%
group_by(group) %>%
mutate(cum.score = cumsum(score),
rank = last(cum.score))
``````

but the `rank` just ascribes the total to them all instead of an actual rank.

## 3 Answers On Best Solutions

One way to do this would be to calculate the groups ranks in a sub-table and join that back to the original. You can do that with

``````df.test %>%
group_by(group) %>%
mutate(cum.score = cumsum(score)) %>% {
x <- .
x %>% summarize(max = max(cum.score)) %>%
mutate(rank=rank(max, ties.method ="first"), max=NULL) %>%
left_join(x, .)
}
``````

This seems necessary because you want values assigned within-group but you want the rank assigned across-groups. On

I realise this is tagged with `dplyr`, but since there's already a good `dplyr` solution I'll add a `data.table` version.

In `data.table` you can join two tables `a` and `b`, adding a column of `b` (say `b\$x`) to `a` with `a[b, on = .(somevar), x := i.x]`

``````library(data.table)
setDT(df.test)

df.test[, cum.score := cumsum(score), group]
df.test[df.test[, max(cum.score), group][, V1 := frank(V1, ties.method = 'first')]
, on = .(group), rank := i.V1]

df.test
#     group score cum.score rank
#  1:     1     0         0    1
#  2:     1     2         2    1
#  3:     2     1         1    4
#  4:     2     3         4    4
#  5:     2     2         6    4
#  6:     3     2         2    3
#  7:     3     1         3    3
#  8:     2     2         8    4
#  9:     2     0         8    4
# 10:     1     0         2    1
# 11:     4     2         2    2
`````` On

First define `cum.score` within group. Also define `rank` within group as a logical variable which is TRUE for the last value. Then outside of the grouping rank the last `cum.score` values using NA for the values other than last within group, i.e. those values assigned FALSE in the prior step. Finally in a second grouping fill the NAs with the rank for that group.

``````df.test %>%
group_by(group) %>%
mutate(cum.score = cumsum(score), rank = 1:n() == n()) %>%
ungroup %>%
mutate(rank = replace(NA * score, rank, rank(cum.score[rank], ties = "first"))) %>%
group_by(group) %>%
mutate(rank = na.omit(rank)) %>%
ungroup
``````