How to return whole non-latin strings matching a reduplication pattern, such as AAB or ABB

93 views Asked by At

I am working with strings of non-latin characters. I want to match strings with reduplication patterns, such as AAB, ABB, ABAB, etc. I tried out the following code:

import re

patternAAB = re.compile(r'\b(\w)\1\w\b')
match = patternAAB.findall(rawtext)
print(match) 

However, it reurns only the first character of the matched string. I know this happens because of the capturing parenthesis around the first \w.

I tried to add capturing parenthesis around the whole matched block, but Python gives

error: cannot refer to an open group at position 7

I also found this method,but didn't work for me:

patternAAB = re.compile(r'\b(\w)\1\w\b')
match = patternAAB.search(rawtext)
if match:
    print(match.group(1))

How could I match the pattern and return the whole matching string?

# Ex. 哈哈笑 
# string matches AAB pattern so my code returns 哈 
# but not the entire string
2

There are 2 answers

3
Alex Hall On BEST ANSWER

The message:

error: cannot refer to an open group at position 7

is telling you that \1 refers to the group with parentheses all around, because its opening parenthesis comes first. The group you want to backreference is number 2, so this code works:

import re

rawtext = 'abc 哈哈笑 def'

patternAAB = re.compile(r'\b((\w)\2\w)\b')
match = patternAAB.findall(rawtext)
print(match)

Each item in match has both groups:

[('哈哈笑', '哈')]
2
Alex Hall On

I also found this method, but didn't work for me:

You were close here as well. You can use match.group(0) to get the full match, not just a group in parentheses. So this code works:

import re

rawtext = 'abc 哈哈笑 def'

patternAAB = re.compile(r'\b(\w)\1\w\b')
match = patternAAB.search(rawtext)
if match:
    print(match.group(0))   # 哈哈笑