Match single-line comments via regex in Notepad++

142 views Asked by At

Why do these two regexes yield different results in Notepad++?

  1. //.*?\n|//.*$|\s+|. (2 matches → screenshot)
  2. //.*?(?:\n|$)|\s+|. (3 matches → screenshot)

Background

I'm writing a primitive lexer for Delphi in Perl. The purpose is to extract words (identifiers and keywords), it therefore doesn't need to properly recognize all kinds of tokens.

Its core is the following regex:

\{[^}]*\}|\(\*([^*]|\*[^\\])*?\*\)|[A-Za-z_]\w*|\d+|//.*?$|'([^']|'')*?'|\s+|.

What I found out by chance is that line endings where not consumed by line comments. So I was curious if I could modify the regex so that two consecutive lines consisting entirely by line comments got counted as 2 "tokens".

// first line
// last line

I replaced //.*?$ by //.*?\n but with this regex a line comment placed directly before the EOF (without a newline) will not be matched, instead it's broken into /, / and so on. And so I searched for the right way to express the alternation correctly. I found two regexes that behave differently in Notepad++ and winGrep but same in Perl:

The actual difference was already shown in the introductory question:

  1. \{[^}]*\}|\(\*([^*]|\*[^\\])*?\*\)|[A-Za-z_]\w*|\d+|//.*?\n|//.*?$|'([^']|'')*?'|\s+|. (2 matches in above sample source)

  2. \{[^}]*\}|\(\*([^*]|\*[^\\])*?\*\)|[A-Za-z_]\w*|\d+|//.*?(?:\n|$)|'([^']|'')*?'|\s+|. (3 matches in above sample source)

It can be observed in Notepad++ (7.7.1 32-bit) and grepWin (1.9.2 64-bit). In Perl, where I place the regexes between m@( and )@mg, there are 2 matches with both.

1

There are 1 answers

0
Wolf On

Windows Line Break Anatomy

The observed difference between Perl and the external tools is caused by the difference between \r\n and \n. If you read a text file in Perl, the newline character (sequence) gets translated into \n which is one character, so \n matches this char as the line break.

In Notepad and grepWin, this translation is not carried out. So //.*?(?:\n|$) never consumes the newline sequence, it instead stops at its beginning (right between e and \r) where the regex engine matches $, the \r remains in the input; the \s+ then matches the whole newline sequence (\r\n).

enter image description here

//.*?\n on the other hand matches the \r with a . and after that the \n.

If you change the newline in the pattern into \r\n for the external tools, both alternatives give two matches:

  • //.*?\r\n|//.*$|\s+|.

  • //.*?(?:\r\n|$)|\s+|.