I've got a data.table DT with a string column and a numeric column that indicates how many words from the start of the string should be extracted.
> require(data.table)
> DT <- data.table(string_col = c("A BB CCC", "DD EEE FFFF GDG", "AB DFD EFGD ABC DBC", "ABC DEF")
, first_n_words = c(2, 3, 3, 1))
> DT
string_col first_n_words
1: A BB CCC 2
2: DD EEE FFFF GDG 3
3: AB DFD EFGD ABC DBC 3
4: ABC DEF 1
I'd like to add a new column with the first-n-words of the string_col, as following:
> output_DT
string_col first_n_words output_string_col
1: A BB CCC 2 A BB
2: DD EEE FFFF GDG 3 DD EEE FFFF
3: AB DFD EFGD ABC DBC 3 AB DFD EFGD
4: ABC DEF 1 ABC
This is the gsub syntax that can be used:
> gsub(paste0("^((\\w+\\W+){", first_n_words - 1, "}\\w+).*$"),"\\1", string_col)
I basically need to create this gsub function for every row, using first_n_words of that row before applying it to string_col of that row. I'm only interested in a data.table syntax solution as it's a very large data set. a gsub solution would be most desired.
Edit: I've tried the following and it doesn't work
> DT[, output_string_col := gsub(paste0("^((\\w+\\W+){", first_n_words - 1, "}\\w+).*$"),"\\1", string_col)]
Warning message:
In gsub(paste0("^((\\w+\\W+){", first_n_words - 1, "}\\w+).*$"), :
argument 'pattern' has length > 1 and only the first element will be used
>## This is not the desired output
> DT
string_col first_n_words output_string_col
1: A BB CCC 2 A BB
2: DD EEE FFFF GDG 3 DD EEE
3: AB DFD EFGD ABC DBC 3 AB DFD
4: ABC DEF 1 ABC DEF
This is not the desired output
An answer to keep your use of data.table is to use a grouping operation, as you need a value in gsub, not a vector:
Edit
As @Franck remarqued the grouping should be on
first_n_words
to be more efficientthe benchmark with this modified version gives faster results :