Grouping dataframe in 12 groups with same column values

1.2k views Asked by At

I have a large dataset with about 15 columns and more than 3 million rows.

Because the dataset is so big, I would like to use multidplyron it .

Because of the data, it would be impossible to just split my data frame to 12 parts. Lets say that there are columns col1 and col2 which each have several different values but they repeat (in each column separately).

How can I make 12 (or n) similar sized groups which each of them contain rows that have the same value in both col1 and col2?

Example: Lets say one of the possible values in col1 foo and in col2 is bar. Then they would be grouped, all rows with this values would be in one group.

So that the question makes sense, there are always more than 12 unique combinations of col1 and col2.

I would try to do something with for and while loops if this was python but as this is R, there probably is another way.

2

There are 2 answers

2
Roman On

Try this:

# As you provided no example data, I created some data repeating three times.
# I used dplyr within tidyverse. Then grouped by the columns and sliced 
# the data by chance for n=2. 
library(tidyverse)
df <- data.frame(a=rep(LETTERS,3), b=rep(letters,3))
# the data:
df %>%
   arrange(a,b) %>% 
   group_by(a,b) %>% 
   mutate(n=1:n())
# A tibble: 78 x 3
# Groups:   a, b [26]
        a      b     n
   <fctr> <fctr> <int>
 1      A      a     1
 2      A      a     2
 3      A      a     3
 4      B      b     1
 5      B      b     2
 6      B      b     3
 7      C      c     1
 8      C      c     2
 9      C      c     3
10      D      d     1
# ... with 68 more rows

Slicing down the data by chance on two rows per group.

set.seed(123)
df %>%
  arrange(a,b) %>% 
  group_by(a,b) %>% 
  mutate(n=1:n()) %>% 
  sample_n(2)
# A tibble: 52 x 3
# Groups:   a, b [26]
        a      b     n
   <fctr> <fctr> <int>
 1      A      a     1
 2      A      a     2
 3      B      b     2
 4      B      b     3
 5      C      c     3
 6      C      c     1
 7      D      d     2
 8      D      d     3
 9      E      e     2
10      E      e     1
# ... with 42 more rows
0
NOR On
# Create sample data 
library(dplyr)
df <- data.frame(a=rep(LETTERS,3), b=rep(letters,3), 
             nobs=sample(1:100, 26*3,replace=T), stringsAsFactors=F)

# Get all unique combinations of col1 and col2
combos <- df %>%
  group_by(a,b) %>% 
  summarize(n=sum(nobs)) %>% 
  as.data.frame(.) 

top12 <- combos %>% 
  arrange(desc(n)) %>% 
  top_n(12,n)
top12

l <- list()
for(i in 1:11){
  l[[i]] <- combos[combos$a==top12[i,"a"] & combos$b==top12[i,"b"],]
}

l[[12]] <- combos %>% 
  anti_join(top12,by=c("a","b")) 
l

# This produces a list 'l' that contains twelve data frames -- the top 11 most-commonly occuring pairs of col1 and col2, and all the rest of the data in the 12th list element.