What is this regex substitution "$content =~ s/\n-- \n.*?$//s" actually doing?

1.4k views Asked by At

I am working through some Perl code in Request Tracker 4.0 and have encountered an error where ticket requestor's message is cut off. I am new to Perl, I have done some work with regular expressions, but I'm having some trouble with this one even after reading quite a bit.

I have narrowed my problem down to this line of code:

$content =~ s/\n-- \n.*?$//s

I don't fully understand what it is doing and would like a better explanation.

I understand that s/ / is matching the pattern \n-- \n.*?$ and replacing it with nothing.

I don't understand what .*?$ does. Here is my basic understanding:

  • . is any character except \n
  • * is 0 or more times of the preceding character
  • ? is 0 or 1 times of the preceding character
  • $ is the end of the string

Then, from what I understand, the final s makes the . match new lines

So, roughly, we're replacing any text beginning with \n-- \n - this line of code is causing some questionable behavior that I'd love to get sorted out if someone can explain what's going on here.

Can someone explain what this line is doing? Is it just removing all text after the first \n-- \n or is there more to it?

Long winded part / real-life issue (you don't need to read this to answer the question)

My exact problem is that it is cutting the quoted content at the signature.

So if email A from a customer says:

What is going on with order ABCD?
-- Some Customer

The staff reply says (note the loss of the customer's signature)

It is shipping today

What is going on with order ABCD?

The customer replies

I did not get it, it did not ship!!!
-- Some Customer

It is shipping today

What is going on with order ABCD?

When we reply, their message will cut at the -- which kills all the context.

It shipped today, tracking number 12345

I did not get it, it did not ship!!!

And leads to more work explaining what order it is, etc.

3

There are 3 answers

1
Moritz Bunkus On BEST ANSWER

You're almost correct: it removes everything from the last occurrence of "\n-- \n" to the end. That this doesn't remove everything from the first occurrence is due to the non-greedyness operator ? -- it tells the regex engine to match the shortest postsible form of the preceding pattern (.*).

What this does: In email communication the signature is usually separated from the message body by exactly this pattern: a line consisting of exactly two dashes and a single trailing space. Therefore what the regex does is remove everything beginning with the signature separator to the end.

Now what your customer does (either manually or his email client) is add the quoted reply of the email after the signature separator. This is highly unusual: the quoted reply must be located before the signature modifier. I don't know of a single email client that does this on purpose, but alas there are tons of programs out there that simply get email from (from charset issues over quoting to SMTP non-conformance you can make an incredible number of mistakes), so I wouldn't be surprised to learn that there are indeed such clients.

Another possibility is that this is an affectation of the client -- like signing his own name after --. However, I suspect this is not done manually as humans seldom insert a trailing space after two dashes followed by a line break.

0
ikegami On

When ? follows a quantifier (?, *, + or {m,n}), it modifies the greediness of that quantifier[1]. Normally, these quantifiers match the most characters as possible, but with ?, they match the fewest.

say "Greedy:     ", "abc1234" =~ /\w(.*)\d/;
say "Non-greedy: ", "abc1234" =~ /\w(.*?)\d/;

Output:

bc123
bc

Since there two places $ can match (before a trailing newline or at the end of the string), this has the following effect:

$_ = "abc\n-- \ndef\n";
say "Greedy:     <<" . s/\n-- \n.*$//sr  . ">>";
say "Non-greedy: <<" . s/\n-- \n.*?$//sr . ">>";

Output:

Greedy:     <<abc>>
Non-greedy: <<abc
>>

It ensures the newline terminating the last line isn't removed. The following are more straightforward equivalents:

s/\n-- \n.*/\n/s

s/(?<=\n)-- \n.*//s   # Slow

s/\n\K-- \n.*//s      # Requires 5.10

Note that it will remove starting with the first --.

$ perl -E'say "abc\n-- \ndef\n-- \nghi\n" =~ s/\n-- \n.*?$//sr'
abc

If you want to start removing from the last, you'll have to replace .* with something guaranteed not to match --.

$ perl -E'say "abc\n-- \ndef\n-- \nghi\n" =~ s/\n-- \n(?:(?!-- \n).)*?$//sr'
abc
-- 
def

Notes:

  1. It also has the same meaning if it follows another quantifier modifier (e.g. /.*+?/).
0
dms On

There is a nice CPAN module that can help you understand regular expressions in the future: YAPE::Regex::Explain

You can find an online version of it here: http://rick.measham.id.au/paste/explain.pl

Running your regex through the website returns the following:

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  \n                       '\n' (newline)
--------------------------------------------------------------------------------
  --                       '-- '
--------------------------------------------------------------------------------
  \n                       '\n' (newline)
--------------------------------------------------------------------------------
  .*?                      any character except \n (0 or more times
                           (matching the least amount possible))
--------------------------------------------------------------------------------
  $                        before an optional \n, and the end of the
                           string

According to the docs, "There is no support for regular expression syntax added after Perl version 5.6, particularly any constructs added in 5.10", but in practice you should still be able to use it to help understand most regexes you come across.