Match same number of repetitions of character as repetitions of captured group

208 views Asked by At

I would like to clean some input that was logged from my keyboard with python and regex. Especially when backspace was used to fix a mistake.

Example 1:

[in]:  'Helloo<BckSp> world'
[out]: 'Hello world'

This can be done with

re.sub(r'.<BckSp>', '', 'Helloo<BckSp> world')

Example 2:
However when I have several backspaces, I don't know how to delete exactly the same number of characters before:

[in]:  'Helllo<BckSp><BckSp>o world'
[out]: 'Hello world'

(Here I want to remove 'l' and 'o' before the two backspaces).

I could simply use re.sub(r'[^>]<BckSp>', '', line) several times until there is no <BckSp> left but I would like to find a more elegant / faster solution.

Does anyone know how to do this ?

5

There are 5 answers

1
Wiktor Stribiżew On

Since there is no support for recursion/subroutine calls, no atomic groups/possessive quantifiers in Python re, you may remove these chars followed with backspaces in a loop:

import re
s = "Helllo\b\bo world"
r = re.compile("^\b+|[^\b]\b")
while r.search(s): 
    s = r.sub("", s)
print(s)

See the Python demo

The "^\b+|[^\b]\b" pattern will find 1+ backspace chars at the string start (with ^\b+) and [^\b]\b will find all non-overlapping occurrences of any char other than a backspace followed with a backspace.

Same approach in case a backspace is expressed as some enitity/tag like a literal <BckSp>:

import re
s = "Helllo<BckSp><BckSp>o world"
r = re.compile("^(?:<BckSp>)+|.<BckSp>", flags=re.S)
while r.search(s): 
    s = r.sub("", s)
print(s)

See another Python demo

1
Fallenhero On

It looks like Python does not support recursive regex. If you can use another language, you could try this:

.(?R)?<BckSp>

See: https://regex101.com/r/OirPNn/1

1
Casimir et Hippolyte On

It isn't very efficient but you can do that with the re module:

(?:[^<](?=[^<]*((?=(\1?))\2<BckSp>)))+\1

demo

This way you don't have to count, the pattern only uses the repetition.

(?: 
    [^<] # a character to remove
    (?=  # lookahead to reach the corresponding <BckSp>
        [^<]* # skip characters until the first <BckSp>
        (  # capture group 1: contains the <BckSp>s
            (?=(\1?))\2 # emulate an atomic group in place of \1?+
                        # The idea is to add the <BcKSp>s already matched in the
                        # previous repetitions if any to be sure that the following
                        # <BckSp> isn't already associated with a character
            <BckSp> # corresponding <BckSp>
        )
    )
)+ # each time the group is repeated, the capture group 1 is growing with a new <BckSp>

\1 # matches all the consecutive <BckSp> and ensures that there's no more character
   # between the last character to remove and the first <BckSp>

You can do the same with the regex module, but this time you don't need to emulate the possessive quantifier:

(?:[^<](?=[^<]*(\1?+<BckSp>)))+\1

demo

But with the regex module, you can also use the recursion (as @Fallenhero noticed it):

[^<](?R)?<BckSp>

demo

2
niemmi On

In case the marker is single character you could just utilize stack which would give you the result in single pass:

s = "Helllo\b\bo world"
res = []

for c in s:
    if c == '\b':
        if res:
            del res[-1]
    else:
        res.append(c)

print(''.join(res)) # Hello world

In case the marker is literally '<BckSp>' or some other string with length greater than 1 you can use replace to substitute it to '\b' and use the solution above. This only works if you know that '\b' doesn't occur in the input. If you can't designate a substitute character you could use split and process the results:

s = 'Helllo<BckSp><BckSp>o world'
res = []

for part in s.split('<BckSp>'):
    if res:
        del res[-1]
    res.extend(part)

print(''.join(res)) # Hello world
0
anubhava On

Slightly verbose but you can use this lambda function to count # of <BckSp> occurrence and use substring routines to get your final output.

>>> bk = '<BckSp>'

>>> s = 'Helllo<BckSp><BckSp>o world'
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s)
Hello world

>>> s = 'Helloo<BckSp> world'
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s)
Hello world

>>> s = 'Helloo<BckSp> worl<BckSp>d'
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s)
Hello word

>>> s = 'Helllo<BckSp><BckSp>o world<BckSp><BckSp>k'
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s)
Hello work