Grouping by factor absent in dataset

88 views Asked by At

I'm using dplyr to calculate the grouping of a variable [mode], for a series of files. The goal is to build a data frame where you copy (cbind) the results for every file (any ideas on a better data structure welcome).

So files are like this:

FILE 1:
id |mode | no.cases
1    1      3
2    2      1
3    5      2

FILE 'n':
id |mode | no.cases
1    1      11
2    3      7
3    4      1

Some files may have entries for all 5 categories, others only for 2 or 3 of them. I can obtain all the existing categories from a complete file using: categories<-factor(mode)

Still, the problem appears when using group_by:

result <-group_by(file,mode,sum(no.cases))    
result (for file 1):
mode | no.cases
1      3
2      1
5      2

since you can't cbind on different no. of rows: in some cases the resulting data frame will have 5 rows, in some only 3 rows.

How can you use group_by so that it shows, for each file, all the existing categories, and the ones absent showing value 0?

E.g.

result (for file 1):
mode | no.cases
1      3
2      1
3      0
4      0
5      2

Thanks in advance, p.

1

There are 1 answers

1
Manohar Swamynathan On BEST ANSWER

Assuming that you have put all categories from a complete file into a data frame called categories.df

categories <- c(1,2,3,4,5)
# create data frame
categories.df <- data.frame(categories)
# rename column name
colnames(categories.df)[colnames(categories.df)=="categories"] <- "mode"

> categories.df
  mode
    1
    2
    3
    4
    5

Below is the sample code to merge categories.df with your file1

R Code:

# Method 1: merge function
df <- merge(x = categories.df, y = file1, by = "mode", all = TRUE)
df$id <- NULL # remove id
df[is.na(df)] <- 0 # replace na with zero

# Method 2: using dplyr
library(dplyr)
df <- left_join(categories.df, file1)
df$id <- NULL # remove id
df[is.na(df)] <- 0 # replace na with zero

# Method 2: using plyr
library(plyr)
df <- join(categories.df, file1, type="left")
df$id <- NULL # remove id
df[is.na(df)] <- 0 # replace na with zero

Output:

> df
  mode no.cases
    1        3
    2        1
    3        0
    4        0
    5        2

Hope this helps.