Match single-line comments via regex in Notepad++

Question

Match single-line comments via regex in Notepad++

142 views Asked by Wolf At 14 August 2019 at 13:07

Why do these two regexes yield different results in Notepad++?

//.*?\n|//.*$|\s+|. (2 matches → screenshot)
//.*?(?:\n|$)|\s+|. (3 matches → screenshot)

Background

I'm writing a primitive lexer for Delphi in Perl. The purpose is to extract words (identifiers and keywords), it therefore doesn't need to properly recognize all kinds of tokens.

Its core is the following regex:

\{[^}]*\}|$\*([^*]|\*[^\\])*?\*$|[A-Za-z_]\w*|\d+|//.*?$|'([^']|'')*?'|\s+|.

What I found out by chance is that line endings where not consumed by line comments. So I was curious if I could modify the regex so that two consecutive lines consisting entirely by line comments got counted as 2 "tokens".

// first line
// last line

I replaced //.*?$ by //.*?\n but with this regex a line comment placed directly before the EOF (without a newline) will not be matched, instead it's broken into /, / and so on. And so I searched for the right way to express the alternation correctly. I found two regexes that behave differently in Notepad++ and winGrep but same in Perl:

The actual difference was already shown in the introductory question:

\{[^}]*\}|$\*([^*]|\*[^\\])*?\*$|[A-Za-z_]\w*|\d+|//.*?\n|//.*?$|'([^']|'')*?'|\s+|. (2 matches in above sample source)
\{[^}]*\}|$\*([^*]|\*[^\\])*?\*$|[A-Za-z_]\w*|\d+|//.*?(?:\n|$)|'([^']|'')*?'|\s+|. (3 matches in above sample source)

It can be observed in Notepad++ (7.7.1 32-bit) and grepWin (1.9.2 64-bit). In Perl, where I place the regexes between m@( and )@mg, there are 2 matches with both.

Original Q&A

There are 1 answers

**Wolf** · Answer 1 · 2019-08-19T10:03:40+00:00

Windows Line Break Anatomy

The observed difference between Perl and the external tools is caused by the difference between \r\n and \n. If you read a text file in Perl, the newline character (sequence) gets translated into \n which is one character, so \n matches this char as the line break.

In Notepad and grepWin, this translation is not carried out. So //.*?(?:\n|$) never consumes the newline sequence, it instead stops at its beginning (right between e and \r) where the regex engine matches $, the \r remains in the input; the \s+ then matches the whole newline sequence (\r\n).

//.*?\n on the other hand matches the \r with a . and after that the \n.

If you change the newline in the pattern into \r\n for the external tools, both alternatives give two matches:

//.*?\r\n|//.*$|\s+|.
//.*?(?:\r\n|$)|\s+|.

TechQA.

Match single-line comments via regex in Notepad++

Why do these two regexes yield different results in Notepad++?

Background

There are 1 answers

Windows Line Break Anatomy

Related Questions in REGEX

Related Questions in NOTEPAD++

Related Questions in NEWLINE

Related Questions in REGEX-ALTERNATION

Popular Questions

Trending Questions