I have a dataframe that contains fields with strings, such as "fish, birds, animals", etc. I have collapsed them into a list, and iterate over them in order to create logical fields within same dataframe. Update: The question is now updated with a more elaborate example.
However, this is slow and does not feel optimal. It's not an operation I have to do more than once, so I wasn't that bothered, but think there might be a better way, using dplyr, perhaps.
This code does create new fields for every match of each element in my_list within the field items.
no <- seq(1:3)
items <- c('fish,cat,dog', 'horse,elephant,dog', 'hamster,pig')
df <- data.frame(no, items)
df$items <- as.character(df$items)
df
Will create the following data frame:
no items
1 1 fish,cat,dog
2 2 horse,elephant,dog
3 3 hamster,pig
Running this code will harvest the field items and expand it into logical fields
items <- paste(df$items, collapse = ",")
item_list <- unlist(unique(strsplit(items, ",")))
for (i in 1:length(item_list)) {
lt <- item_list[i]
df <- df %>% rowwise() %>% mutate(!!lt := grepl(lt, items))
}
data.frame(df)
Resulting in this data frame:
no items fish cat dog horse elephant hamster pig
1 1 fish,cat,dog TRUE TRUE TRUE FALSE FALSE FALSE FALSE
2 2 horse,elephant,dog FALSE FALSE TRUE TRUE TRUE FALSE FALSE
3 3 hamster,pig FALSE FALSE FALSE FALSE FALSE TRUE TRUE
This will be fairly fast
The trickiest part is that a matrix subset by a two-column matrix treats the first column of the two-column matrix as a row index, and the second column as the column index. So the rows and columns that you want to set to
TRUE
areand the matrix is updated with
There is no iteration (e.g.,
sapply()
), so thematch()
function is called once rather thannrow(df)
times.For 3M rows, I have
For the other solution (at time of writing) by Christoph:
and the data.table solution by Uwe (watch out, reference semantics change the value of dt! Also, I don't know how to pass the column name as a function argument):
with times
About 1/2 the time of
f1()
is used bystrsplit()
;stringr::str_split()
is about two times faster, but since the pattern used to split is fixed (not a regular expression) it makes sense to usestrsplit(fixed=TRUE)
, which is about 3x faster. Probably some data.table pro will come up with a very fast solution (but then you need to be a data.table pro...).It's tempting to do things like 'collapse them [words shared by an item] into a list [actually a vector!]', but it will often be sensible to leave words in a list
and save yourself the time / trouble required to re-split. The tidyverse way would be to create an extended version of the table
(or the other approaches in the so-called 'duplicate' question). This would probably cause one to re-think what the role of the columns of logicals is in subsequent manipulations.