Well in fact I'm pretty sure I'm using the wrong terminology. Here is the problem I want to solve: a parser for the markdown
format, well a subset of it.
My problem is with the blockquote feature. Each line in a blockquote starts with >
; otherwise everything is the normal structure in a markdown file.
You can't look at individual lines separately, because you need to separate paragraphs from normal lines, eg
> a
> b
is not the same as
> a
>
> b
and things like that (same if a list is blockquoted you don't want x lists but one list with x elements). A natural and trivial way is to "take off" the >
signs, parse the blockquote on its own, ignoring anything around it, wrap it with a BlockQuote
type constructor, put that in the outer AST and resume parsing of the original input. It's what pango
does if I'm not wrong:
blockQuote :: MarkdownParser (F Blocks)
blockQuote = do
raw <- emailBlockQuote
-- parse the extracted block, which may contain various block elements:
contents <- parseFromString parseBlocks $ (intercalate "\n" raw) ++ "\n\n"
return $ B.blockQuote <$> contents
And then:
http://hackage.haskell.org/package/pandoc-1.5.1/docs/src/Text-Pandoc-Shared.html#parseFromString
-- | Parse contents of 'str' using 'parser' and return result.
parseFromString :: GenParser tok st a -> [tok] -> GenParser tok st a
parseFromString parser str = do
oldPos <- getPosition
oldInput <- getInput
setInput str
result <- parser
setInput oldInput
setPosition oldPos
return result
Now parseFromString
looks quite hacky to me and besides that it's also Parsec
not attoparsec
so I can't use it in my project. I'm not sure how I could take that Text
from the blockquote and parse it and return the parsing result so that it "fits" within the current parsing. Seems impossible?
I've been googling on the issue and I think that pipes-parse
and conduit
can help on that area although I struggle to find examples and what I see appears considerably less nice to look at than "pure" parsec/attoparsec parsers.
Other options to parse blockquotes would be to rewrite the usual parsers but with the >
catch... Complicating and duplicating a lot. Parsing blockquotes counting each line separately and writing some messy "merge" function. Or parsing to a first AST that would contain the blockquotes as Text
inside a first BlockquoteText
type constructor waiting for a transformation where they would be parsed separately, not very elegant but it has the benefit of simplicity, which does count for something.
I would probably go for the latter, but surely there's a better way?
I have asked myself the same question. Why is there no standard combinator for nested parsers like how you describe? My default mode is to trust the package author, especially when that author also co-wrote "Real World Haskell". If such an obvious capability is missing, perhaps it is by design and I should look for a better way. However, I've managed to convince myself that such a convenient combinator is mostly harmless. Useful whenever an all-or-nothing type parser is appropriate for the inner parse.
Implementation
I've divided the required functionality into two parsers. The first,
constP
, performs an "in place" parse of some given text. It substitutes the constant parser's fail withempty
(from Alternative), but otherwise has no other side effects.The second part comes from
parseOf
, which performs the constant, inner parse based on the result of the outer parse. Theempty
alternative here allows a failed parse to return without consuming any input.The block quote markdown can be written in the desired fashion. This implementation requires the resulting block to be totally parsed.
Instead of the actual markdown parser, I just implemented a quick parser of space separated doubles. The complication of the parser comes from allowing the last, non-empty line, either end in a new line or not.
These two parsers are responsible for returning the text internal to block quotes.
Finally, a test of the parser.
Discussion
The notable difference comes in the semantics of failure. For instance, when parsing
axb
and blockquotedaxb
, which are the following two strings, respectivelyand
the markdown parse results in
whereas the quoted results in
The markdown consumes "7 3 1\n", but this is nowhere reported in the quoted failure. Instead, fail becomes all or nothing.
Likewise, there is no allowance for handling unparsed text in the case of partial success. But I don't see a need for this, given the use case. For example, if a parse looked something like the following
where
{}
denotes the recognized block quote context, and<tok>
is parsed within that inner context. A partial success then would have to lift "unhandled" out of that block quote context and somehow combine it with "more to parse".I see no general way to do this, but it is allowed through choice of inner parser return type. For instance, by some parser
parseOf blockP innP :: Parser (<tok>,Maybe Text)
. However, if this need arises I would expect that there is a better way to handle the situation than with nested parsers.There may also be concerns about the loss of attoparsec Partial parsing. That is, the implementation of
constP
usesparseOnly
, which collapses the parse returnFail
andPartial
into a singleLeft
failure state. In other words, we lose the ability to feed our inner parser with more text as it becomes available. However, note that text to parse is itself the result of an outer parse; it will only be available after enough text has been fed to the outer parse. So this shouldn't be an issue either.