I'm looking for an algorithm that will allow me to parse inline markdown elements (so strong emphasis between **
or __
, regular emphasis between *
or _
, etc.) that avoids self-nesting these elements.
The CommonMark algorithm for this does not prevent self-nesting, so an input like **some **random** text**
will get interpreted as <strong>some <strong>random</strong> text</strong>
.
Since nesting <strong>
or <em>
tags does not make the text even more bold or italic, I don't think it makes sense to parse the input in this way and would prefer to leave the asterisks that do not contribute to any additional styling, so the output would look like <strong>some **random** text</strong>
.
I know I could fix this in the code generation stage instead of the parsing stage, by letting the AST visitor check if we're already emitting into a <strong>
tag and then based on that deciding to emit asterisks instead, but since underscores can be used as well, the delimiter character used in the source text now also has to be added to the token, and this overall just feels like the wrong thing to do anyway. The text shouldn't have been parsed this way from the beginning.
The issue with the CommonMark algorithm is, that since it uses backtracking to find matching opening delimiters for closing delimiters (instead of the other way around which seems to be more common in other markdown-variant parsers that I have found), it will wrap nodes that were already marked as emphasis earlier into a new emphasis node.
One thought I had was to search for the opening delimiter furthest from the closing delimiter instead of the closest, but that would mean inputs such as **some** random **text**
would get interpreted as <strong>some** random **text</strong>
, effectively allowing only a single emphasis in the whole line of text.
So the question is:
Does there already exists an algorithm that does this? Or is there a way to modify the CommonMark algorithm to prevent this from happening?
The idea
I think the correct way to approach this is to tokenise the text and populate a token array.This way you can also remove unallowed special characters.
Tokens can be words, special characters or emphasis tags (like **)
Description of what the program does
The idea behind the below code is:
parsedText
to be returned)Example implementation
And here is a JavaScript implementation of this:
Any problems or misunderstood requriements, let me know.