Cumulative count of unique values per group

2.3k views Asked by At

I have a df with names and some dates of eligibility status. I would like to create an indicator of how many unique elig_end_dates a person has, according to time. here is my df:

 names date_of_claim elig_end_date
1    tom    2010-01-01    2010-07-01
2    tom    2010-05-04    2010-07-01
3    tom    2010-06-01    2014-01-01
4    tom    2010-10-10    2014-01-01
5   mary    2010-03-01    2014-06-14
6   mary    2010-05-01    2014-06-14
7   mary    2010-08-01    2014-06-14
8   mary    2010-11-01    2014-06-14
9   mary    2011-01-01    2014-06-14
10  john    2010-03-27    2011-03-01
11  john    2010-07-01    2011-03-01
12  john    2010-11-01    2011-03-01
13  john    2011-02-01    2011-03-01

Here is my desired output:

 names date_of_claim elig_end_date obs
1    tom    2010-01-01    2010-07-01   1
2    tom    2010-05-04    2010-07-01   1
3    tom    2010-06-01    2014-01-01   2
4    tom    2010-10-10    2014-01-01   2
5   mary    2010-03-01    2014-06-14   1
6   mary    2010-05-01    2014-06-14   1
7   mary    2010-08-01    2014-06-14   1
8   mary    2010-11-01    2014-06-14   1
9   mary    2011-01-01    2014-06-14   1
10  john    2010-03-27    2011-03-01   1
11  john    2010-07-01    2011-03-01   1
12  john    2010-11-01    2011-03-01   1
13  john    2011-02-01    2011-03-01   1

I found this post useful R: Count unique values by category, but the answers are given as a seperate table as opposed to being included in the df.

I have also tried this:

df$ob = ave(df$elig_end_date, df$elig_end_date, FUN=seq_along)

But this creates a count, and I really just want an indicator.

Thank you in advance

PRODUCT OF STEPHEN'S CODE(which isn't the right code - just posting as a learning point)

names date_of_claim elig_end_date ob
1    tom    2010-01-01    2010-07-01  2
2    tom    2010-05-04    2010-07-01  2
3    tom    2010-06-01    2014-01-01  2
4    tom    2010-10-10    2014-01-01  2
5   mary    2010-03-01    2014-06-14  5
6   mary    2010-05-01    2014-06-14  5
7   mary    2010-08-01    2014-06-14  5
8   mary    2010-11-01    2014-06-14  5
9   mary    2011-01-01    2014-06-14  5
10  john    2010-03-27    2011-03-01  4
11  john    2010-07-01    2011-03-01  4
12  john    2010-11-01    2011-03-01  4
13  john    2011-02-01    2011-03-01  4
1

There are 1 answers

0
Henrik On BEST ANSWER

Another possibility using ave:

df$obs <- with(df, ave(elig_end_date, names,
                       FUN = function(x) cumsum(!duplicated(x))))

#    names date_of_claim elig_end_date obs
# 1    tom    2010-01-01    2010-07-01   1
# 2    tom    2010-05-04    2010-07-01   1
# 3    tom    2010-06-01    2014-01-01   2
# 4    tom    2010-10-10    2014-01-01   2
# 5   mary    2010-03-01    2014-06-14   1
# 6   mary    2010-05-01    2014-06-14   1
# 7   mary    2010-08-01    2014-06-14   1
# 8   mary    2010-11-01    2014-06-14   1
# 9   mary    2011-01-01    2014-06-14   1
# 10  john    2010-03-27    2011-03-01   1
# 11  john    2010-07-01    2011-03-01   1
# 12  john    2010-11-01    2011-03-01   1
# 13  john    2011-02-01    2011-03-01   1