Regex capture group with different quantifiers

252 views Asked by At

The text I am parsing includes asterisks before and after the capture group, as well as within the capture group. The pattern I have identified to parse the capture group is basically:The consecutive asterisks before the text will always be at least 30. Similarly, the consecutive asterisk after the last word will be at least 15 characters long. The consecutive asterisks in the capture group will always be less be below 10.The problem I am having is I am unsure how to give a different quantifier to the asterisks in the capture group versus the other characters in the group, but include it in the same match set. For example:

text = 'ÿÿÿÿ*************************************************CURRICULUM VITAE***Información *personal*********************ìÌ**Ì*Ì*Ì*'

So basically, I need to capture the text section only. Asterisks can exist before and after the actual text ( I can remove later), but the gibberish can't. So either outputs work:

#Output #1 
CURRICULUM VITAE***Información *personal
#output #2
**********CURRICULUM VITAE***Información *personal**********

Below is the code I have tried which is unable to differentiate between the capture group and the subsequent gibberish. It does correctly identify the asteris before the text though.

p=re.compile(r'(?<=[*]{30})([\x29{,10}|\u00c0-\u00d6|\u00d8-\u00f6|\u00f8-\u02af|\u1d00-\u1d25|\u1d62-\u1d65|\u1d6b-\u1d77|\u1d79-\u1d9a|\u1e00-\u1eff|\u2090-\u2094|\u2184-\u2184|\u2488-\u2490|\u271d-\u271d|\u2c60-\u2c7c|\u2c7e-\u2c7f|\ua722-\ua76f|\ua771-\ua787|\ua78b-\ua78c|\ua7fb-\ua7ff|\ufb00-\ufb06|\x20-\x2A|\x2B-\x7E]+)(?=[*]{,15})', re.MULTILINE)

print(re.findall(p, text)[0])

#output
*******************CURRICULUM VITAE***Información *personal*********************ìÌ**Ì*Ì*Ì*

As you can see, it successfully cuts off the gibberish before the actual capture group, but does not cut off the gibberish after the capture group. I am guessing the above regex is not written properly so that \x29{,10} is executed together with the rest of the characters, which can have + occurrences.

As a note, \x29 is the unicode for *. Changing the unicode characters as a way to parse the capture group is not an option, I need to be able to maintain the accents, which may exist in the gibberish section as well.

EDIT PER MAX XAPI's COMMENT

There can be 15+ consecutive asterisks that exist after the gibberish, so with your code it seems to cut at the last 15+ consecutive asterisk, but keeps the prior(s). So what I need is that the match either cuts at the first 15 consecutive asterisks (i.e., no asterisk after the capture group) OR includes the first 15 asterisks after the capture group only. For example:

p=re.compile(r'(?<=[*]{30})([^*][\x2A{,10}|\u00c0-\u00d6|\u00d8-\u00f6|\u00f8-\u02af|\u1d00-\u1d25|\u1d62-\u1d65|\u1d6b-\u1d77|\u1d79-\u1d9a|\u1e00-\u1eff|\u2090-\u2094|\u2184-\u2184|\u2488-\u2490|\u271d-\u271d|\u2c60-\u2c7c|\u2c7e-\u2c7f|\ua722-\ua76f|\ua771-\ua787|\ua78b-\ua78c|\ua7fb-\ua7ff|\ufb00-\ufb06|\x20-\x2A|\x2B-\x7E]+[^*])(?=[*]{15,})',re.MULTILINE)

text=t='ÿÿÿÿ*************************************************CURRICULUM VITAE***Información *personal**********************ìÌ**Ì*Ì*Ì*************************************(ìÌ**Ì*Ì*Ì***************'

#output
print(re.findall(p, text))
['CURRICULUM VITAE***Información *personal**********************ìÌ**Ì*Ì*Ì']

#desired output
['CURRICULUM VITAE***Información *personal']
The following is also acceptable
['CURRICULUM VITAE***Información *personal***************']
2

There are 2 answers

0
Booboo On BEST ANSWER

This only uses one negative lookahead assertion:

Try:

\*{30,}((?:[^*]|\*(?!\*{9}))+?)\*{15,}

Regex Demo

  1. \*{30,} Matches 30 or more asterisks
  2. ( Start of capture group 1
  3. (?:[^*]|\*(?!\*{9}))+? Match one or more in a non-capture group of: either a non-asterisk or an asterisk that is not followed by 9 more asterisks
  4. `)' end of capture group 1
  5. \*{15,} Matches 15 or more astersisks
import re

text = 'ÿÿÿÿ*************************************************CURRICULUM VITAE***Información *personal*********************ìÌ**Ì*Ì*Ì*'

l = re.findall(r'\*{30,}((?:[^*]|\*(?!\*{9}))+?)\*{15,}', text)
print(l)

Prints:

['CURRICULUM VITAE***Información *personal']
1
Max Xapi On

You can use a trick considering that the capture group must start by something else than a * and must end the same way. So by just addind another container group with two [^*]:

(?<=[*]{30})([^*][\x29{,10}|\u00c0-\u00d6|\u00d8-\u00f6|\u00f8-\u02af|\u1d00-\u1d25|\u1d62-\u1d65|\u1d6b-\u1d77|\u1d79-\u1d9a|\u1e00-\u1eff|\u2090-\u2094|\u2184-\u2184|\u2488-\u2490|\u271d-\u271d|\u2c60-\u2c7c|\u2c7e-\u2c7f|\ua722-\ua76f|\ua771-\ua787|\ua78b-\ua78c|\ua7fb-\ua7ff|\ufb00-\ufb06|\x20-\x2A|\x2B-\x7E]+[^*])(?=[*]{15,})

I've added/changed:

  • added two occurences of a "non *" at the end and the beginning of your capturing group: ([^*] ... [^*])
  • changed the {,15} by a {15,} at then end (so "at least 15 occurences" instead of "maximum 15 occurences")

https://regex101.com/r/m6lqP3/3