Insert dots around substring using gsub in R

338 views Asked by At

I would like to have a function in R that inserts dots (".") around a given substring (e.g. "alpha") if they are not already present. E.g. string="10-alpha-epoxy-Amorph-4-ene" should return

"10-.alpha.-epoxy-Amorph-4-ene"

string="alpha-cadolene" should return

".alpha.-cadolene"

but string=".alpha.-cadolene" should return

".alpha.-cadolene"

(the substring could occur multiple times)

What would be the easiest way to accomplish this using gsub in R?

cheers, Tom

2

There are 2 answers

4
agstudy On BEST ANSWER

I would do something like this :

gsub("[.]?(alpha)[.]?", ".\\1.", c("10-alpha-epoxy-Amorph-4-ene",
                               ".alpha.-cadolene", "alpha.-cadolene",
                                ".alpha-cadolene"                              
                                 ))
[1] "10-.alpha.-epoxy-Amorph-4-ene" ".alpha.-cadolene"             
    ".alpha.-cadolene"              ".alpha.-cadolene"  

EDIT Generalization to many terms:

In case you have a list of terms , you can use create tour regular expression using paste:

terms <- c('alpha','gamma','beta','delta')

gsub(paste0("[.]?(",paste0(terms,collapse='|'),")[.]?"), ".\\1.", 
                c("10-alpha-epoxy-Amorph-4-ene",
                 ".gamma.-cadolene", "beta.-cadolene",
                 ".delta-cadolene")) 

[1] "10-.alpha.-epoxy-Amorph-4-ene" ".gamma.-cadolene"             
    ".beta.-cadolene"              
[4] ".delta.-cadolene"  

EDIT get the list of greels in full letter:

library(XML)
dat <- readHTMLTable("http://en.wikipedia.org/wiki/Greek_alphabet",
                     strinsAsFactors=FALSE)[[2]]

terms <- as.character(dat$V2[-c(1,2)])
 [1] "alpha"   "beta"    "gamma"   "delta"   "epsilon" "zeta"    "eta"     "theta"   "iota"    "kappa"   "lambda" 
[12] "mu"      "Name"    "Modern"  "nu"      "xi"      "omicron" "pi"      "rho"     "sigma"   "tau"     "upsilon"
[23] "phi"     "chi"     "psi"     "omega"  
0
Dirk is no longer here On

Here is one way:

R> gsub("-(alpha)-", ".-\\1-.", "10-.alpha.-epoxy-Amorph-4-ene")
[1] "10-.alpha.-epoxy-Amorph-4-ene"
R> gsub("-(alpha)-", ".-\\1-.", "10-alpha-epoxy-Amorph-4-ene")
[1] "10.-alpha-.epoxy-Amorph-4-ene"
R> 

The (....) expression is retained, making it easy to recall it in the replacement part as \\1 (where a second such expression would be \\2).

But explicitly naming the expression, you are making sure not other match can occur. You can of course generalize this:

gsub("-([a-z]*)-", ".-\\1-.", "10-.alpha.-epoxy-Amorph-4-ene")

would replace any expression of lowercase letters (but not punctuation, digits, ...).