I have been trying to get the separate groups from the below string using regex in PCRE:
drop = blah blah blah something keep = bar foo nlah aaaa rename = (a=b d=e) obs=4 where = (foo > 45 and bar == 35)
Groups I am trying to make is like:
1. drop = blah blah blah something
2. keep = bar foo nlah aaaa
3. rename = (a=b d=e)
4. obs=4
5. where = (foo > 45 and bar == 35)
I have written a regex using recursion but for some reason recursion is partially working for selecting multiple words after drop
like it's selecting just first 3 words (blah blah blah) and not the 4th one. I have looked through various stackoverflow questions and have tried using positive lookahead also but this is the closest I could get to and now I am stuck because I am unable to understand what I am doing wrong.
The regex that I have written: (?i)(drop|keep|where|rename|obs)\s*=\s*((\w+|\d+)(\s+\w+)(?4)|(\((.*?)\)))
Same can be seen here: RegEx Demo.
Any help on this or understanding what I am doing wrong is appreciated.
You can use a branch reset group solution:
See the PCRE regex demo
Details
(?i)
- case insensitive mode on\b
- a word boundary(drop|keep|where|rename|obs)
- Group 1: any of the words in the group\s*=\s*
- a=
char enclosed with 0+ whitespace chars(?|
- start of a branch reset group:(\w+(?:\s+\w+)*)
- Group 2: one or more word chars followed with zero or more repetitions of one or more whitespaces and one or more word chars(?=\s+\w+\s+=|$)
- up to one or more whitespaces, one or more word chars, one or more whitespaces, and=
, or end of string|
- or\((.*?)\)
-(
, then Group 2 capturing any zero or more chars other than line break chars, as few as possible and then)
)
- end of the branch reset group.See Python demo: