Efficiently finding the count of column values for distinct rows in a dataframe in r

179 views Asked by At

Suppose I have a data frame as:

id   value
1    "hi"
1    "hi"
1    "hi again"
1    "hi again"
2    "hello"
2    "hi"

Now I want to get the count of each value for each of the distinct values in id column. The output would be like

id    value       Freq
1     "hi"        2
1     "hi again"  2
2     "hello"     1
2     "hi"        1   

I tried splitting up the first data frame for each distinct id and get the frequency using the table() function on the value column and appending the id column later. Also, I end up with a lot of dataframes in my memory. I just want to know if I can achieve the above dataframe without chewing up my memory with lot of dataframes(as I have almost 5 million rows).

1

There are 1 answers

7
grrgrrbla On BEST ANSWER

assuming your data.frame is called df, using data.table:

library(data.table)
setDT(df)[ , .(Freq = .N), by = .(id, value)]

using dplyr:

libary(dplyr)
group_by(df, id, value) %>% summarise(Freq = n())

You should choose one of those two packages (dplyr or data.table) and learn it really thoroughly. In the long run you will likely use both. But beginning with one and really understanding it will help you tremendously. I use both pretty much everytime I use R.

dplyr tends to be easier for beginners, so I would read a tutorial on it. This will help you forever. There is also a great video-tutorial which can be found on this site under The grammar and graphics of datascience.

I personally prefer data.table because it is faster and more flexible. Check the new HTML vignettes and the PDF vignettes here.