Regex capture group with different quantifiers

Question

Regex capture group with different quantifiers

246 views Asked by Rizakha At 22 October 2020 at 04:35

The text I am parsing includes asterisks before and after the capture group, as well as within the capture group. The pattern I have identified to parse the capture group is basically:The consecutive asterisks before the text will always be at least 30. Similarly, the consecutive asterisk after the last word will be at least 15 characters long. The consecutive asterisks in the capture group will always be less be below 10.The problem I am having is I am unsure how to give a different quantifier to the asterisks in the capture group versus the other characters in the group, but include it in the same match set. For example:

text = 'ÿÿÿÿ*************************************************CURRICULUM VITAE***Información *personal*********************ìÌ**Ì*Ì*Ì*'

So basically, I need to capture the text section only. Asterisks can exist before and after the actual text ( I can remove later), but the gibberish can't. So either outputs work:

#Output #1 
CURRICULUM VITAE***Información *personal
#output #2
**********CURRICULUM VITAE***Información *personal**********

Below is the code I have tried which is unable to differentiate between the capture group and the subsequent gibberish. It does correctly identify the asteris before the text though.

p=re.compile(r'(?<=[*]{30})([\x29{,10}|\u00c0-\u00d6|\u00d8-\u00f6|\u00f8-\u02af|\u1d00-\u1d25|\u1d62-\u1d65|\u1d6b-\u1d77|\u1d79-\u1d9a|\u1e00-\u1eff|\u2090-\u2094|\u2184-\u2184|\u2488-\u2490|\u271d-\u271d|\u2c60-\u2c7c|\u2c7e-\u2c7f|\ua722-\ua76f|\ua771-\ua787|\ua78b-\ua78c|\ua7fb-\ua7ff|\ufb00-\ufb06|\x20-\x2A|\x2B-\x7E]+)(?=[*]{,15})', re.MULTILINE)

print(re.findall(p, text)[0])

#output
*******************CURRICULUM VITAE***Información *personal*********************ìÌ**Ì*Ì*Ì*

As you can see, it successfully cuts off the gibberish before the actual capture group, but does not cut off the gibberish after the capture group. I am guessing the above regex is not written properly so that \x29{,10} is executed together with the rest of the characters, which can have + occurrences.

As a note, \x29 is the unicode for *. Changing the unicode characters as a way to parse the capture group is not an option, I need to be able to maintain the accents, which may exist in the gibberish section as well.

EDIT PER MAX XAPI's COMMENT

There can be 15+ consecutive asterisks that exist after the gibberish, so with your code it seems to cut at the last 15+ consecutive asterisk, but keeps the prior(s). So what I need is that the match either cuts at the first 15 consecutive asterisks (i.e., no asterisk after the capture group) OR includes the first 15 asterisks after the capture group only. For example:

p=re.compile(r'(?<=[*]{30})([^*][\x2A{,10}|\u00c0-\u00d6|\u00d8-\u00f6|\u00f8-\u02af|\u1d00-\u1d25|\u1d62-\u1d65|\u1d6b-\u1d77|\u1d79-\u1d9a|\u1e00-\u1eff|\u2090-\u2094|\u2184-\u2184|\u2488-\u2490|\u271d-\u271d|\u2c60-\u2c7c|\u2c7e-\u2c7f|\ua722-\ua76f|\ua771-\ua787|\ua78b-\ua78c|\ua7fb-\ua7ff|\ufb00-\ufb06|\x20-\x2A|\x2B-\x7E]+[^*])(?=[*]{15,})',re.MULTILINE)

text=t='ÿÿÿÿ*************************************************CURRICULUM VITAE***Información *personal**********************ìÌ**Ì*Ì*Ì*************************************(ìÌ**Ì*Ì*Ì***************'

#output
print(re.findall(p, text))
['CURRICULUM VITAE***Información *personal**********************ìÌ**Ì*Ì*Ì']

#desired output
['CURRICULUM VITAE***Información *personal']
The following is also acceptable
['CURRICULUM VITAE***Información *personal***************']

Original Q&A

There are 2 answers

Max Xapi On 22 October 2020 at 05:48

You can use a trick considering that the capture group must start by something else than a * and must end the same way. So by just addind another container group with two [^*]:

(?<=[*]{30})([^*][\x29{,10}|\u00c0-\u00d6|\u00d8-\u00f6|\u00f8-\u02af|\u1d00-\u1d25|\u1d62-\u1d65|\u1d6b-\u1d77|\u1d79-\u1d9a|\u1e00-\u1eff|\u2090-\u2094|\u2184-\u2184|\u2488-\u2490|\u271d-\u271d|\u2c60-\u2c7c|\u2c7e-\u2c7f|\ua722-\ua76f|\ua771-\ua787|\ua78b-\ua78c|\ua7fb-\ua7ff|\ufb00-\ufb06|\x20-\x2A|\x2B-\x7E]+[^*])(?=[*]{15,})

I've added/changed:

added two occurences of a "non *" at the end and the beginning of your capturing group: ([^*] ... [^*])
changed the {,15} by a {15,} at then end (so "at least 15 occurences" instead of "maximum 15 occurences")

https://regex101.com/r/m6lqP3/3

**Booboo** · Accepted Answer · 2020-10-22T18:55:20+00:00

This only uses one negative lookahead assertion:

Try:

\*{30,}((?:[^*]|\*(?!\*{9}))+?)\*{15,}

Regex Demo

\*{30,} Matches 30 or more asterisks
( Start of capture group 1
(?:[^*]|\*(?!\*{9}))+? Match one or more in a non-capture group of: either a non-asterisk or an asterisk that is not followed by 9 more asterisks
`)' end of capture group 1
\*{15,} Matches 15 or more astersisks

import re

text = 'ÿÿÿÿ*************************************************CURRICULUM VITAE***Información *personal*********************ìÌ**Ì*Ì*Ì*'

l = re.findall(r'\*{30,}((?:[^*]|\*(?!\*{9}))+?)\*{15,}', text)
print(l)

Prints:

['CURRICULUM VITAE***Información *personal']

TechQA.

Regex capture group with different quantifiers

There are 2 answers

Related Questions in PYTHON

Related Questions in REGEX

Related Questions in REGEX-GROUP

Related Questions in QUANTIFIERS

Popular Questions

Popular Tags

Trending Questions