Named backreference (?P=name) issue in Python re

2.1k views Asked by At

I am learning 're' part of Python, and the named pattern (?P=name) confused me,

When I using re.sub() to make some exchange for digit and character, the patter '(?P=name)' doesn't work, but the pattern '\N' and '\g<name>' still make sense. Code below:

[IN]print(re.sub(r'(?P<digit>\d{3})-(?P<char>\w{4})', r'(?P=char)-(?P=digit)', '123-abcd'))
[OUT] (?P=char)-(?P=digit)
[IN] print(re.sub(r'(?P<digit>\d{3})-(?P<char>\w{4})', r'\2-\1', '123-abcd'))
[OUT] abcd-123
[IN] print(re.sub(r'(?P<digit>\d{3})-(?P<char>\w{4})', r'\g<char>-\g<digit>', '123-abcd'))
[OUT] abcd-123

Why it failed to make substitute when I use (?P=name)?
And how to use it correctly?
I am using Python 3.5

2

There are 2 answers

1
Wiktor Stribiżew On BEST ANSWER

The (?P=name) is an inline (in-pattern) backreference. You may use it inside a regular expression pattern to match the same content as is captured by the corresponding named capturing group, see the Python Regular Expression Syntax reference:

(?P=name)
A backreference to a named group; it matches whatever text was matched by the earlier group named name.

See this demo: (?P<digit>\d{3})-(?P<char>\w{4})&(?P=char)-(?P=digit) matches 123-abcd&abcd-123 because the "digit" group matches and captures 123, "char" group captures abcd and then the named inline backreferences match abcd and 123.

To replace matches, use \1, \g<1> or \g<char> syntax with re.sub replacement pattern. Do not use (?P=name) for that purpose:

repl can be a string or a function... Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern...

In string-type repl arguments, in addition to the character escapes and backreferences described above, \g<name> will use the substring matched by the group named name, as defined by the (?P<name>...) syntax. \g<number> uses the corresponding group number; \g<2> is therefore equivalent to \2, but isn’t ambiguous in a replacement such as \g<2>0. \20 would be interpreted as a reference to group 20, not a reference to group 2 followed by the literal character '0'. The backreference \g<0> substitutes in the entire substring matched by the RE.

0
madtyn On

You can check the details of using and back-referencing ?P visiting:

https://docs.python.org/3/library/re.html

and using CTRL+F in your browser to look for (?P...). It comes a nice chart with all the instructions about when you can make use of ?P=name.

For this example, you're doing right at your third re.sub() call.

In the all re.sub() calls you can only use the ?P=name syntax in the first string parameter of this method and you don't need it in the second string parameter because you have the \g syntax.

In case you're confuse about the ?P=name being useful, it is, but for making a match by backreferencing an already named string.

Example: you want to match potatoXXXpotato and replace it for YYXXXYY. You could make:

re.sub(r'(?P<myName>potato)(XXX)(?P=myName)', r'YY\2YY', 'potatoXXXpotato')

or

re.sub(r'(?P<myName>potato)(?P<triple>XXX)(?P=myName)', r'YY\g<triple>YY', 'potatoXXXpotato')