Often, the data from multiple response survey items are structured without sufficient information to make tidying very easy. Specifically, I have a survey question in which respondents pick one or more of 8 categorical items. The resulting dataframe has up to 8 strings separated by commas. Some cells might have two, four or none of the 8 options separated by commas. The eighth item is "Other" and may be populated with custom text.
Incidentally, this is a typical format for GoogleForms multiple response data.
Below are example data. The third and last rows include a unique response for the eighth "other" option:
structure(list(actvTypes = c(NA, NA, "Data collection, Results / findings / learnings, ate ants and milkweed",
NA, "Discussion of our research question, Planning for data collection",
"Data analysis, Collected data, apples are yummy")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
I'd like to make a set of 8 new columns into which the responses are recorded as either 0 or 1. How can this be done efficiently?
I have a solution but it is cumbersome. I started by creating new columns for each of the response options:
atypes<- c("atype1","atype2","atype3","atype4","atype5","atype6","atype7","atype8")
log[atypes]<-NA
Next, I wrote eight ifelse
statements; the format for the first seven is shown below:
log$atype7<-ifelse(str_detect(log$actvTypes,"Met with non-DASA team member (not data collection)"),1,0)
For the "other" response option, I used a list of strings and a sapply
solution:
alloptions<-c('Discussion of our research question' ,'Planning for data collection' ,'Data analysis','Discussion of results | findings | learnings' ,'Mid-course corrections to our project' ,'Collected data' ,'Met with non-DASA team member (not data collection)' )
log$atype8<-sapply(log$actvTypes, function(x)
ifelse(
any(sapply(alloptions, str_detect, string = x)==TRUE),1,0) )
How might this coding scheme be more elegant? Perhaps sapply
and using an index?
Depending on what you're ultimately trying to do, the following could be helpful:
Taking note of what this looks like right before the call to
count()
-- grouping up the "other" category should be trivial if you know the "non-other" categories beforehand. You may also want to look at what this looks like after the call toseparate_rows()
.