Mixing Parser Char (lexer?) vs. Parser String

279 views Asked by At

I've written several compilers and am familiar with lexers, regexs/NFAs/DFAs, parsers and semantic rules in flex/bison, JavaCC, JavaCup, antlr4 and so on.

Is there some sort of magical monadic operator that seamlessly grows/combines a token with a mix of Parser Char (ie Text.Megaparsec.Char) vs. Parser String?

Is there a way / best practices to represent a clean separation of lexing tokens and nonterminal expectations?

2

There are 2 answers

1
K. A. Buhr On BEST ANSWER

Typically, one uses applicative operations to directly combine Parser Char and Parser Strings, rather than "upgrading" the former. For example, a parser for alphanumeric identifiers that must start with a letter would probably look like:

ident :: Parser String
ident = (:) <$> letterChar <*> alphaNumChar

If you were doing something more complicated, like parsing dollar amounts with optional cents, for example, you might write:

dollars :: Parser String
dollars = (:) <$> char '$' <*> some digitChar
          <**> pure (++)
          <*> option "" ((:) <$> char '.' <*> replicateM 2 digitChar)

If you find yourself trying to build a Parser String out of a complicated sequence of Parser Char and Parser String parsers in a lot of situations, then you could define a few helper operators. If you find the variety of operators annoying, you could just define (<++>) and a short-form for charToStr like c :: Parser Char -> Parser String.

(<.+>) :: Parser Char -> Parser String -> Parser String
p <.+> q = (:) <$> p <*> q
infixr 5 <.+>

(<++>) :: Parser String -> Parser String -> Parser String
p <++> q = (++) <$> p <*> q
infixr 5 <++>

(<..>) :: Parser Char -> Parser Char -> Parser String
p <..> q = p <.+> fmap (:[]) q
infixr 5 <..>

so you can write something like:

dollars' :: Parser String
dollars' = char '$' <.+> some digitChar 
           <++> option "" (char '.' <.+> digitChar <..> digitChar)

As @leftroundabout says, there's nothing hackish about fmap (:[]). If you prefer, write fmap (\c -> [c]) if you think it looks clearer.

1
leftaroundabout On

There's nothing nasty or hackish about fmap (: []) (or fmap pure or pure <$>) – it's the natural thing to do, performing a conversion that's concise, safe, expressive and transparent all at the same time.

An alternative that I wouldn't really recommend, but for some situations it might express the intent best: sequence [charParser]. This makes it clear that you're executing “all” of the parsers in a list of character-parsers, and gathering the result“s” as a list of character“s”.