Why doesn't (...)? regular expression capture the string?

Question

Why doesn't (...)? regular expression capture the string?

132 views Asked by Dmitry Kuzminov At 25 December 2022 at 08:20

I have a code where a QString is being modified using a regular expression:

QString str; // str is the string that shall be modified
QString pattern, after; // pattern and after are parameters provided as arguments

str.replace(QRegularExpression(pattern), after);

Whenever I need to append something to the end of the string I use the arguments:

QString pattern("$");
QString after("ending");

Now I have a case where the same pattern is being applied two times, but it shall append the string only once. I expected that this should work (I assume that the initial string doesn't end on "ending"):

QString pattern("(ending)?$");
QString after("ending");

But if applied twice this pattern produces double ending: "<initial string>endingending".

Looks like the ()? expression is lazy, and it captures the expression in parentheses if I force it with a sub-expression before:

QString pattern("string(ending)?$");
QString after("ending");

QString str("Init string");
str.replace(QRegularExpression(pattern), after);
// str == "Init ending"

What's wrong with the "()?" construction (why it is lazy) and how to achieve my goal?

I'm using Qt 5.14.0 (due to some dependencies I cannot use Qt6).

Original Q&A

There are 3 answers

Daksh On 25 December 2022 at 08:28

What ? in your regex is doing is that it is telling the regex engine that the string can optionally end with ending. Your question is a bit unclear, but if I understand it correctly, what you need instead is a negative lookbehind. Changing your pattern as follows should do the trick:

QString pattern(".*(?<!ending)$");

This makes sure that it only matches strings that don't originally end with ending. You can play with it here.

Marek R On 25 December 2022 at 09:36

Ok I have explanation why it happens (so question in title is answered).

Basically QRegExp::replace or std::regex_replace finds two matches and performs two replacements. One where capture group matches ending and second times when capture group do not match and only ending is matched.

this is result of fact that $ is just an assertion. It doesn't match any character, so can be used multiple times in index based search (when doing replace all).

Here is demo in clean C++ which illustrates the issue:

int main()
{
    std::string s;
    auto r = std::regex{"(ending)?$"};
    auto after = "ending";
    while(getline(std::cin, s)) {
        std::cout << "s: " << s << '\n';
        std::cout << "replace: " << std::regex_replace(s, r, after) << '\n';
        for (auto i = std::regex_iterator{s.begin(), s.end(), r};
            i != decltype(i){};
            ++i) {
            std::cout << "found: " << i->str() << " capture: " << i->str(1);
            std::cout << '\n';
        }
        std::cout << "------------\n";
    }

    return 0;
}

https://godbolt.org/z/e85ajb9aP

Now knowing root cause you can try address this issue.

I come up with: match whole string, using two capture grups one none greedy which will be used in after: re: ^(.*?)(ending)?$ and after: $1ending

https://godbolt.org/z/EKz4fEaT7

**peppe** · Accepted Answer · 2022-12-27T17:18:45+00:00

A pattern like (foo)?$ matches twice at the end of a string ending with foo. You can see easily in action in Perl or https://regex101.com/r/3Oqwo1/1 :

$ perl -E '$_ = "abcfoo"; while ($_ =~ /(foo)?$/g) { say "Matched |$&| from $-[0] to $+[0]"; }'

Matched |foo| from 3 to 6
Matched || from 6 to 6

Therefore you'll do two substitutions at the end, neglecting your purpose.

(A way to see this is that patterns match between characters:

            /-----------\
            v           v   first pattern matches here
| a | b | c | f | o | o |
                       ^ ^
                       \-/  second pattern matches here

If the "tail" is fixed-length, you can use a negative lookbehind, like already suggested: (?<!foo)$.

$ perl -E '$_ = "abcfoo"; while ($_ =~ /(?<!foo)$/g) { say "Matched |$&| from $-[0] to $+[0]"; }'
# no match
$ perl -E '$_ = "abcfie"; while ($_ =~ /(?<!foo)$/g) { say "Matched |$&| from $-[0] to $+[0]"; }'
Matched || from 6 to 6

Note that there's no .* before, nor ? after the negative lookbehind. If you add them, you'll again break the matching:

$ perl -E '$_ = "abcfie"; while ($_ =~ /.*(?<!foo)$/g) { say "Matched |$&| from $-[0] to $+[0]"; }'
Matched |abcfie| from 0 to 6
Matched || from 6 to 6

Global matching will happen twice in abcfie, once matching the entire string, and again matching the empty string at the end (look at the offsets). This will result in 2 replacements.

perl -E '$_ = "abcfoo"; while ($_ =~ /(?<!foo)?$/g) { say "Matched |$&| from $-[0] to $+[0]"; }'
Matched || from 6 to 6

This will match at the very end of the string, resulting in a replacement that you don't want (string already ends in foo).

TechQA.

Why doesn't (...)? regular expression capture the string?

There are 3 answers

Related Questions in C++

Related Questions in REGEX

Related Questions in QT5

Related Questions in QSTRING

Popular Questions

Trending Questions