Extract all substrings in string

139 views Asked by At

I want to extract all substrings that begin with M and are terminated by a *

The string below as an example;

vec<-c("SHVANSGYMGMTPRLGLESLLE*A*MIRVASQ")

Would ideally return;

MGMTPRLGLESLLE
MTPRLGLESLLE

I have tried the code below;

regmatches(vec, gregexpr('(?<=M).*?(?=\\*)', vec, perl=T))[[1]]

but this drops the first M and only returns the first string rather than all substrings within.

"GMTPRLGLESLLE"
3

There are 3 answers

0
Wiktor Stribiżew On BEST ANSWER

You can use

(?=(M[^*]*)\*)

See the regex demo. Details:

  • (?= - start of a positive lookahead that matches a location that is immediately followed with:
  • (M[^*]*) - Group 1: M, zero or more chars other than a * char
  • \* - a * char
  • ) - end of the lookahead.

See the R demo:

library(stringr)
vec <- c("SHVANSGYMGMTPRLGLESLLE*A*MIRVASQ")
matches <- stringr::str_match_all(vec, "(?=(M[^*]*)\\*)")
unlist(lapply(matches, function(z) z[,2]))
## => [1] "MGMTPRLGLESLLE" "MTPRLGLESLLE" 

If you prefer a base R solution:

vec <- c("SHVANSGYMGMTPRLGLESLLE*A*MIRVASQ")
matches <- regmatches(vec, gregexec("(?=(M[^*]*)\\*)", vec, perl=TRUE))
unlist(lapply(matches, tail, -1))
## => [1] "MGMTPRLGLESLLE" "MTPRLGLESLLE"
1
William Lafond On

This could be done instead with a for loop on a char array converted from you string.

If you encounter a M you start concatenating chars to a new string until you encounter a *, when you do encounter a * you push the new string to an array of strings and start over from the first step until you reach the end of your loop.

It's not quite as interesting as using REGEX to do it, but it's failsafe.

1
danlooo On

It is not possible to use regular expressions here, because regular languages don't have memory states required for nested matches.

stringr::str_extract_all("abaca", "a[^a]*a") only gives you aba but not the sorrounding abaca.

The first M was dropped, because (?<=M) is a positive look behind which is by definition not part of the match, but just behind it.