how to separate a string based on specific requirements

33 views Asked by At

I have a variable, a, it contains characters like :

DEVICE PRF .75MG 0.5ML
DEVICE PRF 1.5MG 0.5MLX4
CAP 12-25MG 30
CAP DR 60MG 100UD 3270-33 (32%)

I would like to split them into three parts(or variables):

x               y               z
DEVICE PRF    .75MG            0.5ML
DEVICE PRF     1.5MG          0.5MLX4
CAP           12-25MG           30
CAP DR         60MG          100UD 3270-33 (32%)

The first part is the description, the second is the strength, and the third part is the volume. I think I can use gregexpr(), but not sure how to implement it. Any suggestions are appreciated. Thank you!

2

There are 2 answers

0
C. Braun On

Using the assumption that the middle part has no spaces and always starts with a . or digit, we can do this in base R like this:

a <- c("DEVICE PRF .75MG 0.5ML", "DEVICE PRF 1.5MG 0.5MLX4",
       "CAP 12-25MG 30", "CAP DR 60MG 100UD 3270-33 (32%)")

a_as_csv <- sub('([^.0-9]*) ([.0-9][^ ]+) (.*)', '\\1,\\2,\\3', a)

read.csv(textConnection(a_as_csv), col.names = c('x', 'y', 'z'), header = F)
#            x       y                   z
# 1 DEVICE PRF   .75MG               0.5ML
# 2 DEVICE PRF   1.5MG             0.5MLX4
# 3        CAP 12-25MG                  30
# 4     CAP DR    60MG 100UD 3270-33 (32%)
2
Julius Vainora On

You could use

library(stringr)
str_match(x, "(.*)[ ]{1,}(.*(MG|ML))[ ]{1,}(.*)")[, -c(1, 4)]
#      [,1]         [,2]      [,3]                 
# [1,] "DEVICE PRF" ".75MG"   "0.5ML"              
# [2,] "DEVICE PRF" "1.5MG"   "0.5MLX4"            
# [3,] "CAP"        "12-25MG" "30"                 
# [4,] "CAP DR"     "60MG"    "100UD 3270-33 (32%)"

Assuming that the second/middle part always ends with MG or ML and has no spaces.

The pattern (.*)[ ]{1,}(.*(MG|ML))[ ]{1,}(.*) could be read as: the first part to match containing anything + at least one space + the second part to match ending in MG or ML + at least one space + the third part to match containing anything.