cSplit does not work when a field has embedded the separator

218 views Asked by At

I am using cSplit to split a column into three separate columns. The separator is " / "

However, one of my fields has embedded the "/" separator. The third element of the third line was supposed to be and stay as "f/j" after the split.

When I try it in the following example, it creates an extra (fourth) column

name <- c("abc / efg / hij", "abc / abc / hij", "efg / efg / f/j", "abd / efj / hij")
y <- c(1,1.2,3.4, 5)

dt <- data.frame(name,y)
dt
dt <- cSplit(dt,"name","/", drop=FALSE)
dt

When I try it in my original data set, which has over 5,000 lines, it produces the following error:

Error in fread(x, sep[i], header = FALSE):

Expecting 3 cols, but line 2307 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep='/' and/or '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.

2

There are 2 answers

0
A5C1D2H2I1M1N2O1R2T1 On BEST ANSWER

You should be able to just set fixed = FALSE:

cSplit(dt, "name", " / ", fixed = FALSE, drop = FALSE)
##               name   y name_1 name_2 name_3
## 1: abc / efg / hij 1.0    abc    efg    hij
## 2: abc / abc / hij 1.2    abc    abc    hij
## 3: efg / efg / f/j 3.4    efg    efg    f/j
## 4: abd / efj / hij 5.0    abd    efj    hij
0
Abdou On

If the data is structured the same way your name vector is structured, you could use the following which relies on the idea that the targeted / characters are surrounded by whitespace characters:

cSplit(dt,"name"," / ", drop=FALSE)

But as you mentioned, that has led to the following error:

Error in fread(x, sep[i], header = FALSE) : 'sep' must be 'auto' or a single character

While I fail to figure out the main cause of that, I think replacing the targeted / characters with an underscore (or anything else different from a /) and then split on the underscores. The following could serve as an illustration:

dt$name <- gsub("([^/]+)/([^/]+)/(.*)", "\\1_\\2_\\3", dt$name)
cSplit(dt, "name", "_", drop=F)

#           name   y name_1 name_2 name_3
# 1: abc_efg_hij 1.0    abc    efg    hij
# 2: abc_abc_hij 1.2    abc    abc    hij
# 3: efg_efg_f/j 3.4    efg    efg    f/j
# 4: abd_efj_hij 5.0    abd    efj    hij

I hope this helps.