I have a dataset with 77 columns and some of them have categorical values (for ex. the column 'Sexual Orientation' can have one of the following values: 'Heterosexual', 'Homosexual', 'Bisexual', 'Other', 'NA'). (I also have some NAs that I will impute after I will have reshaped my data frame).

I want to transform this dataset into a dataset that only have binary values. So, for example, I want the above column to be divided into 4 different columns that say:

Heterosexual    Homosexual    Bisexual     Other
1               0             0            0

or, if I have an NA row, I want it to be represented as following:

Heterosexual    Homosexual    Bisexual     Other
NA              NA            NA           NA

Also, I have 'binary' variables like "Gender" (I only have the values 'Male' and 'Female') and I want this column to be split into two different columns like this:

Male   Female
0      1

or, in the case of NA:

Male   Female
NA     NA

Is there a function that I can use to do this? My professor told me that the function 'reshape' could help me doing it but I have had some troubles using it and I don't think it would work.

Could you please give me any advice? thank you in advance

1 Answers

0
AkselA On Best Solutions

What you're trying to create are called dummy variables, an in R those are created using model.matrix(). Your specific application is a little special however, so some extra fiddling is required.

dtf <- data.frame(id=20:24, 
                  f=c("a", "b", "c", "a", "b"), 
                  g=c("A", "C", NA, "B", "A"),
                  h=c("P", "R", "Q", NA, "Q"))

# (the first column is not a categorical variable, hence not included)
dtf2 <- dtf[-1]

# Pre-allocate a list of the appropriate length
l <- vector("list", ncol(dtf2))

# Loop over each column in dtf2 and 
for (j in 1:ncol(dtf2)) {
    # Make sure to include NA as a level 
    data <- dtf2[j]
    data[] <- factor(dtf2[,j], exclude=NULL)

    # Generate contrasts that include all levels
    cont <- contrasts(data[[1]], contrasts=FALSE)

    # Create dummy variables using the above contrasts, excluding intercept
    # Formula syntax is the same as in e.g. lm(), except the response
    # variable (term to the left of ~) is not included. 
    # '-1' means no intercept, '.' means all variables
    modmat <- model.matrix(~ -1+., data=data, contrasts.arg=cont)

    # Find rows to fill with NA
    nacols <- grep(".*NA$", colnames(modmat))

    # Only do the operations if an NA-column was found
    if (length(nacols > 0)) {
       narows <- rowSums(modmat[, nacols, drop=FALSE]) > 0
       modmat[narows,] <- NA
       modmat <- modmat[,-nacols]
    }

    l[[j]] <- modmat
}

data.frame(dtf[1], do.call(cbind, l))
#   id fa fb fc gA gB gC hP hQ hR
# 1 20  1  0  0  1  0  0  1  0  0
# 2 21  0  1  0  0  0  1  0  0  1
# 3 22  0  0  1 NA NA NA  0  1  0
# 4 23  1  0  0  0  1  0 NA NA NA
# 5 24  0  1  0  1  0  0  0  1  0