I am stuck at the following parsing problem:
Parse some text string that may contain zero or more elements from a limited character set, up to but not including one of a set of termination characters. Content/no content should be indicated through Maybe. Termination characters may appear in the string in escaped form. Parsing should fail on any inadmissible character.
This is what I came up with (simplified):
import qualified Text.Megaparsec as MP
-- Predicate for admissible characters, not including the control characters.
isAdmissibleChar :: Char -> Bool
...
-- Predicate for control characters that need to be escaped.
isControlChar :: Char -> Bool
...
-- The escape character.
escChar :: Char
...
pComponent :: Parser (Maybe Text)
pComponent = do
t <- MP.many (escaped <|> regular)
if null t then return Nothing else return $ Just (T.pack t)
where
regular = MP.satisfy isAdmissibleChar <|> fail "Inadmissible character"
escaped = do
_ <- MC.char escChar
MP.satisfy isControlChar -- only control characters may be escaped
Say, admissible characters are uppercase ASCII, escape is '\', and control is ':'.
Then, the following parses correctly: ABC\:D:EF to yield ABC:D.
However, parsing ABC&D, where & is inadmissible, does yield ABC whereas I would expect an error message instead.
Two questions:
- Why does
failend parsing instead of failing the parser? - Is the above approach sensible to approach the problem, or is there a "proper", canonical way to parse such terminated strings that I am not aware of?
manyhas to allow its sub-parser to fail once without the whole parse failing - for examplemany (char 'A') *> char 'B', while parsing "AAAB", has to fail to parse the B to know it got to the end of the As.You might want
manyTillwhich allows you to recognise the terminator explicitly. Something like this:"ABC&D" would give an error here assuming '&' isn't accepted by isControlChar.
Or if you want to parse more than one component you might keep your existing definition of pComponent and use it with
sepByor similar, like:If you also check for end-of-file after this, like:
then "ABC&D" should give an error again, because the '&' will end the first component but will not be accepted as a separator.