attoparsec: "nested" parsers -- parse a subset of the input with a different parser

1.1k views Asked by At

Well in fact I'm pretty sure I'm using the wrong terminology. Here is the problem I want to solve: a parser for the markdown format, well a subset of it.

My problem is with the blockquote feature. Each line in a blockquote starts with >; otherwise everything is the normal structure in a markdown file.

You can't look at individual lines separately, because you need to separate paragraphs from normal lines, eg

> a
> b

is not the same as

> a
>
> b

and things like that (same if a list is blockquoted you don't want x lists but one list with x elements). A natural and trivial way is to "take off" the > signs, parse the blockquote on its own, ignoring anything around it, wrap it with a BlockQuote type constructor, put that in the outer AST and resume parsing of the original input. It's what pango does if I'm not wrong:

https://hackage.haskell.org/package/pandoc-1.14.0.4/docs/src/Text-Pandoc-Readers-Markdown.html#blockQuote

blockQuote :: MarkdownParser (F Blocks)
blockQuote = do
  raw <- emailBlockQuote
  -- parse the extracted block, which may contain various block elements:
  contents <- parseFromString parseBlocks $ (intercalate "\n" raw) ++ "\n\n"
  return $ B.blockQuote <$> contents

And then:

http://hackage.haskell.org/package/pandoc-1.5.1/docs/src/Text-Pandoc-Shared.html#parseFromString

-- | Parse contents of 'str' using 'parser' and return result.
parseFromString :: GenParser tok st a -> [tok] -> GenParser tok st a
parseFromString parser str = do
  oldPos <- getPosition
  oldInput <- getInput
  setInput str
  result <- parser
  setInput oldInput
  setPosition oldPos
  return result

Now parseFromString looks quite hacky to me and besides that it's also Parsec not attoparsec so I can't use it in my project. I'm not sure how I could take that Text from the blockquote and parse it and return the parsing result so that it "fits" within the current parsing. Seems impossible?

I've been googling on the issue and I think that pipes-parse and conduit can help on that area although I struggle to find examples and what I see appears considerably less nice to look at than "pure" parsec/attoparsec parsers.

Other options to parse blockquotes would be to rewrite the usual parsers but with the > catch... Complicating and duplicating a lot. Parsing blockquotes counting each line separately and writing some messy "merge" function. Or parsing to a first AST that would contain the blockquotes as Text inside a first BlockquoteText type constructor waiting for a transformation where they would be parsed separately, not very elegant but it has the benefit of simplicity, which does count for something.

I would probably go for the latter, but surely there's a better way?

1

There are 1 answers

0
trevor cook On

I have asked myself the same question. Why is there no standard combinator for nested parsers like how you describe? My default mode is to trust the package author, especially when that author also co-wrote "Real World Haskell". If such an obvious capability is missing, perhaps it is by design and I should look for a better way. However, I've managed to convince myself that such a convenient combinator is mostly harmless. Useful whenever an all-or-nothing type parser is appropriate for the inner parse.

Implementation

import Data.Attoparsec.Text
import qualified Data.Text as T
import Data.Text(Text)
import Control.Applicative

I've divided the required functionality into two parsers. The first, constP, performs an "in place" parse of some given text. It substitutes the constant parser's fail with empty (from Alternative), but otherwise has no other side effects.

constP :: Parser a -> Text -> Parser a
constP p t = case parseOnly p t of
  Left _ -> empty
  Right a -> return a

The second part comes from parseOf, which performs the constant, inner parse based on the result of the outer parse. The empty alternative here allows a failed parse to return without consuming any input.

parseOf :: Parser Text -> Parser a -> Parser a
parseOf ptxt pa = bothParse <|> empty
  where
    bothParse = ptxt >>= constP pa

The block quote markdown can be written in the desired fashion. This implementation requires the resulting block to be totally parsed.

blockQuoteMarkdown :: Parser [[Double]]
blockQuoteMarkdown = parseOf blockQuote ( markdownSurrogate <* 
                                          endOfInput
                                        )

Instead of the actual markdown parser, I just implemented a quick parser of space separated doubles. The complication of the parser comes from allowing the last, non-empty line, either end in a new line or not.

markdownSurrogate :: Parser [[Double]]
markdownSurrogate = do
  lns <- many (mdLine <* endOfLine)
  option lns ((lns ++) . pure <$> mdLine1)
  where
    mdLine = sepBy double (satisfy (==' '))
    mdLine1 = sepBy1 double (satisfy (==' '))

These two parsers are responsible for returning the text internal to block quotes.

blockQuote :: Parser Text
blockQuote = T.unlines <$> many blockLine

blockLine :: Parser Text
blockLine = char '>' *> takeTill isEndOfLine <* endOfLine

Finally, a test of the parser.

parseMain :: IO ()
parseMain = do

  putStrLn ""
  doParse "a" markdownSurrogate a
  doParse "_" markdownSurrogate ""
  doParse "b" markdownSurrogate b
  doParse "ab" markdownSurrogate ab
  doParse "a_b" markdownSurrogate a_b
  doParse "badMarkdown x" markdownSurrogate x
  doParse "badMarkdown axb" markdownSurrogate axb

  putStrLn ""
  doParse "BlockQuote ab" blockQuoteMarkdown $ toBlockQuote ab
  doParse "BlockQuote a_b" blockQuoteMarkdown $ toBlockQuote a_b
  doParse "BlockQuote axb" blockQuoteMarkdown $ toBlockQuote axb
  where
    a = "7 3 1"
    b = "4 4 4"
    x = "a b c"

    ab = T.unlines [a,b]
    a_b = T.unlines [a,"",b]
    axb = T.unlines [a,x,b]

    doParse desc p str = do
      print $ T.concat ["Parsing ",desc,": \"",str,"\""]
      let i = parse (p <* endOfInput ) str
      print $ feed i ""

    toBlockQuote = T.unlines
                 . map (T.cons '>')
                 . T.lines

*Main> parseMain

"Parsing a: \"7 3 1\""
Done "" [[7.0,3.0,1.0]]
"Parsing _: \"\""
Done "" []
"Parsing b: \"4 4 4\""
Done "" [[4.0,4.0,4.0]]
"Parsing ab: \"7 3 1\n4 4 4\n\""
Done "" [[7.0,3.0,1.0],[4.0,4.0,4.0]]
"Parsing a_b: \"7 3 1\n\n4 4 4\n\""
Done "" [[7.0,3.0,1.0],[],[4.0,4.0,4.0]]
"Parsing badMarkdown x: \"a b c\""
Fail "a b c" [] "endOfInput"
"Parsing badMarkdown axb: \"7 3 1\na b c\n4 4 4\n\""
Fail "a b c\n4 4 4\n" [] "endOfInput"

"Parsing BlockQuote ab: \">7 3 1\n>4 4 4\n\""
Done "" [[7.0,3.0,1.0],[4.0,4.0,4.0]]
"Parsing BlockQuote a_b: \">7 3 1\n>\n>4 4 4\n\""
Done "" [[7.0,3.0,1.0],[],[4.0,4.0,4.0]]
"Parsing BlockQuote axb: \">7 3 1\n>a b c\n>4 4 4\n\""
Fail ">7 3 1\n>a b c\n>4 4 4\n" [] "Failed reading: empty"

Discussion

The notable difference comes in the semantics of failure. For instance, when parsing axb and blockquoted axb, which are the following two strings, respectively

7 3 1
a b c
4 4 4

and

> 7 3 1
> a b c
> 4 4 4

the markdown parse results in

Fail "a b c\n4 4 4\n" [] "endOfInput"

whereas the quoted results in

Fail ">7 3 1\n>a b c\n>4 4 4\n" [] "Failed reading: empty"

The markdown consumes "7 3 1\n", but this is nowhere reported in the quoted failure. Instead, fail becomes all or nothing.

Likewise, there is no allowance for handling unparsed text in the case of partial success. But I don't see a need for this, given the use case. For example, if a parse looked something like the following

"{ <tok> unhandled }more to parse"

where {} denotes the recognized block quote context, and <tok> is parsed within that inner context. A partial success then would have to lift "unhandled" out of that block quote context and somehow combine it with "more to parse".

I see no general way to do this, but it is allowed through choice of inner parser return type. For instance, by some parser parseOf blockP innP :: Parser (<tok>,Maybe Text). However, if this need arises I would expect that there is a better way to handle the situation than with nested parsers.

There may also be concerns about the loss of attoparsec Partial parsing. That is, the implementation of constP uses parseOnly, which collapses the parse return Fail and Partial into a single Left failure state. In other words, we lose the ability to feed our inner parser with more text as it becomes available. However, note that text to parse is itself the result of an outer parse; it will only be available after enough text has been fed to the outer parse. So this shouldn't be an issue either.