Language Syntax Highlight - Comment Line Starts With * may or may not have following words

1k views Asked by At

I am creating a syntax highlight file for a language and I have everything mapped out and working with one exception.

I cannot come up with a regex that will match the following conditions for a specific line comment style.

If the first non white-space character is an asterisk (*) the line is considered a comment.

I have created many samples that work in regexr but it never captures in vscode.

For example, regexr is cool with this: ^(?:\s*)\*+(?:.*)?\n

So I convert it into the proper format for the tmlanguage.json file: ^(?:\\s*)\\*+(?:.*)?\\n

But it is not capturing properly, if the first character of the line is an *, it does not catch, but if the first character is a whitespace character followed by an * it does work.

I suck at formatting on stackoverflow, so represents a chr(9) tab character. is a space.

*******************************
  *****************************
<tab>*************************
* comment
  * comment
<tab>* comment

But it shouldn't work in these cases:
string *******************************
  string ***************************** string
<tab>string *************************
x *= 3

I am guessing that either the anchor ^ isn't working in my regex or I am escaping something incorrectly.

Any advice?

Please see sample image attached: screenshot

2

There are 2 answers

0
Someone On

Apparently VS Code's syntax highlighter is single-line. No matter how much i tried matching regeces that are over several lines, these never worked.

Second, if you're designing a language I suggest you not to use an arithmetic operator for comments.

Third, apparently you can match newlines in the begin and end attributes. You can try it there.

3
AudioBubble On

I don't know the regex engine you're using. I'm just going to give you some
general tips on how it should be done.

  • First off, if you're reading a string with more than 1 newline in it,
    the anchor ^, in an engines default state means Beginning of String (BOS)

What you want in this case is Multi-Line-Mode. This makes the anchor ^ match at the Beginning of Line (BO L) as well as the BOS.

  • Second, you don't need those non capture groups (?:\s*) (?:.*), they encapsulate single constructs.

  • Third, it is redundant to make a group optional when its enclosed contents are optional (?:.*)?

  • Fourth, you don't need the newline \n construct at the end, since it should not be highlighted anyway, and it might not be present on the last line of text.
    The latter will make it not match.


So, putting it all together, the modified regex would be (?m)^\s*\*.*

Explained

 (?m)     # Inline modifier: Multi-line mode
 ^        # Beginning of line
 \s*      # Optional many whitespace
 \*       # Required at least a single asterisk
 .*       # Optional rest of non-newline characters

Note that you could put a single capture group around the data
if you need to reference it in a replace (?m)^(\s*\*.*)

Also, the language you're using should have a way to specify options when compiling the regex. If the engine doesn't accept inline modifiers (?m) take it out and specify that option when compiling the regex.