I have this data frame
structure(list(rule.id = c(1, 2), rules = structure(1:2, .Label = c("Lamp1.1,Lamp1.2",
"Lamp2.1,Lamp2.2"), class = "factor")), .Names = c("rule.id",
"rules"), row.names = c(NA, -2L), class = "data.frame")
# rule.id rules
#1 1 Lamp1.1,Lamp1.2
#2 2 Lamp2.1,Lamp2.2
which I need to split on the "rules" column by separator comma (","), multiple commas occur (not only 2 like in example) and then transform this into a normalized format with keeping the relevant rule.id value from the original df. The result should look like this:
structure(list(rule.id = c(1, 1, 2, 2), lhs = c("Lamp1.1", "Lamp1.2",
"Lamp2.1", "Lamp2.1")), .Names = c("rule.id", "lhs"), row.names = c(NA,
-4L), class = "data.frame")
# rule.id lhs
#1 1 Lamp1.1
#2 1 Lamp1.2
#3 2 Lamp2.1
#4 2 Lamp2.1
I have a code that takes care of the str split and normalize (long) format, but not sure how to take care of the rule.id requirement
lhs.norm <- as.data.frame(
cbind(
rules.df$ruleid,
unlist(strsplit(
unlist(lapply(strsplit(unlist(lapply(as.character(rules.df$rules),function(x) substr(x,2,nchar(x)))), "} =>", fixed = T), function(x) x[1]))
,","))))
thanks to @acrun solution using
cSplit(rules.df.lhs, "lhs", ",", "long"))
I benchmarked 19 seconds for 1M rows (result was around 2M rows)
We can use
cSplit
fromsplitstackshape
If this is a huge dataset, we can use
stringi
to splitAnother option is
data.table