Performance of uu-parsinglib compared to "try" in Parsec

1.3k views Asked by At

Question

I know Parsec and uu-parsinglib and I've written parsers in both of them. Recently I discovered, that there is a problem in uu-parsinglib, which could significantly affect its performance and I do not see a way to solve it.

Lets consider following Parsec parsers:

pa = char 'a'
pb = char 'b'
pTest = many $ try(pa <* pb)

What would be the equivalent in uu-parsinglib? It would not be the following:

pa = pSym 'a'
pb = pSym 'b'
pTest = pList_ng (pa <* pb)

The difference is, that in Parsec, many would eat (pa <* pb) (pairs of "ab") greedy until it is not matched anymore, while in uu-parsinglib, pList_ng is not greedy, so it will keep in memory possible backtrack ways after parsing each (pa <* pb).

Is there any way to write something like pList(try(pa <* pb)) in uu-parsinglib?

Example

A good example would be

pExample = pTest <* (pa <* pb)

and a sample input of "ababab".

With Parsec, we would get error (because pTest is greedy parsing pairs of "ab"), but in uu-parsinglib, the string would be parsed with no problems.

Edit

We cannot switch from pList_ng to pList, because it would be not equivalent to Parsec version. For example:

pExample2 = pTest <* pa

and a sample input of "ababa" would success in Parsec, but fail in uu-parsinglib, when using greedy pList.

Of course uu-parsinglib will succeed if we use pList_ng here, but it could be a lot slower for some inputs and rules. For example, considering the input "ababab", Parsec would simply fail, because pTest would consume whole string and pa would not be matched. uu-parsinglib will fail also, but checking a more steps - it will match whole string and fail, then throw away last "ab" pair and fail again, etc. If we have some nested rules and funny input text, it will make a huge difference.

A little benchmark

To prove, that the problem exists in real, consider following grammar (in a pseudocode - but I think it is very intuitive):

pattern = many("ab") + "a"
input   = many(pattern)

So as input to our program we get a string containing patterns, for example "abababaaba" contains 2 patterns "abababa" and "aba".

Lets make parsers in both libraries!

uu-parsinglib:

import Data.Char
import qualified Text.ParserCombinators.UU      as UU
import Text.ParserCombinators.UU                hiding(parse)
import Text.ParserCombinators.UU.BasicInstances hiding (Parser)

import System.TimeIt (timeIt)

pa = pSym 'a'
pb = pSym 'b'
pTest = pList $ pList_ng ((\x y -> [x,y]) <$> pa <*> pb) <* pa

main :: IO ()
main = do
    timeIt maininner
    return ()

maininner = do
    let (x,e) = UU.parse ((,) <$> pTest <*> pEnd) (createStr (LineColPos 0 0 0) (concat $ replicate 1000 (concat (replicate 1000 "ab") ++ "a")))
    print $ length x

Parsec:

import           Control.Applicative
import           Text.Parsec          hiding (many, optional, parse, (<|>))
import qualified Text.Parsec          as Parsec

import System.TimeIt (timeIt)

pa = char 'a'
pb = char 'b'
pTest = many $ many (try ((\x y -> [x,y]) <$> pa <*> pb)) <* pa

main :: IO ()
main = do
    timeIt maininner2
    return ()

maininner2 = do
    let Right x = Parsec.runParser pTest (0::Int) "test" $ (concat $ replicate 1000 (concat (replicate 1000 "ab") ++ "a"))
    print $ length x

Result? uu-parsinglib is > 300% slower:

uu-parsinglib - 3.19s
Parsec        - 1.04s

(compiled with ghc -O3 flag)

2

There are 2 answers

0
Doaitse Swierstra On BEST ANSWER

To understand the subtleties it is important to understand the differences between the try construct in Haskell and the non-greedy parsing strategy used in uu-parsinglib. Effectively the latter is a try which just looks ahead one symbol. In that respect it is less powerful than the try construct from Parsec, in which you specify that a specific construct has to be present completely. And then there is the underlying different overall strategy. Parsec uses a back-tracking strategy with explicit tries to commit, whereas uu-parsinglib uses a breadth-first strategy with an occasional single symbol look-ahead.

So it does not come as a surprise that there is a time difference between the two. In the Parsec case it is decided that the tried alternative does apply after having seen the complete construct (two symbols), whereas the greedy uu-parsinglib decides that this must be the right alternative after having successfully seen the first symbol. And this conclusion may be unjustified.

If one moves to the breadth-first strategy uu-parsinglib uses one has to keep track of several alternatives at the same time, and this take time. Two alternative, twice the time, etc.

The advantage of Parsec is that you can tune a back-tracking parser by liberal use of try constructs and by placing alternatives in the right order, but you are also more likely to ask questions on mailing list about why your parsers do not work as expected. You are not so much writing a grammar as writing parser. The uu-parsinglib starts from the other end of the spectrum: we try to accept quite a large collection of grammars (and the parsers implied by them).

My feeling is also that in the presence of try constructs having excellent error-repairing parsers is much more complicated. Once a try construct fails it is not possible to go back there and decide that, with a small correction, it is much better than the alternatives that come after it.

4
raymonad On

The behavior you're describing (using pList_ng) does apply to other parsers (such as the simple list-of-successes method described in, for example, Jeroen Fokker's Functional Parsers) combinators), but not to uu-parsinglib. The library uses a breadth-first strategy to avoid space leaks (as a result of hanging on to the entire input, as would be the case when using a depth-first strategy). That is why I asked whether you had either created a test case or looked at the internals at all.

For a more technical description, see the paper in Text.ParserCombinators.UU.README (and maybe the source code after that). Here I'll use pExample2 to sketch the parsing process. Branching happens in pList_ng (in pTest), which uses <|> to recognize either the empty string or another element. Because pTest is followed by pa, instead of the empty string, the alternative to parsing another element is actually recognizing a single 'a'.

When we see the first 'a' in the input, both alternatives can successfully parse this character. Next, we see a 'b'. Now the alternative that parses only a single 'a' cannot make any further progress, so we drop that one. There is one alternative left: the one that recognizes (a list of) 'a' followed by 'b' (pTest). Next up is another 'a', and there are again two alternatives to consider. But then we see a 'b' and, again, we can immediately drop the second alternative. Then there is one last characer, an 'a', which once more means two alternatives. But now we get to the end of the string and only the alternative obtained by letting pa recognize the final a leads to a succesful parse.

Considering the alternative input "ababab", we see that the pa alternative fails again when we get to the final 'b', so only the pTest alternative remains. That one finishes because we get to the end and then pa (following pTest in pExample2) fails, so we get an error.

At any point, uu-parsinglib only needs to keep alternatives in memory that have not yet failed, and the breadth-first strategy ensures that all alternatives are evaluated in lockstep, so there is no need to hang on to the first 'a' and 'b' until the end of the string is reached and it does not match the whole string first before backtracking.

Edit:

From what I gather about Parsec, this is indeed less efficient, because Parsec does not consider pa until pTest finishes. Section 5.1 in the paper says a few words about something like try for uu-parsinglib, including some objections. Raw speed is not a primary goal and I once saw a presentation of some benchmarks where uu-parsinglib also did not come out on top, but overall it performed reasonably well. If speed is so important for this compiler you mentioned and if you don't need any extra features such as online results or error correction, maybe you should just stick to Parsec? (Or look for more comprehensive benchmarks first.)

There are clearly significant differences between the two libraries, so I'm not surprised to see that Parsec is faster in some cases, although the difference in this case is indeed pretty big. Maybe there is a way to tweak the uu-parsinglib version further (without changing internals as hinted at by that section about greedy parsing in the paper), but it's not obvious (to me anyway).

Well, one thing you could do is rewrite the grammar:

pTest' = pList $ pa *> pList ((\x y -> [x,y]) <$$> pb <*> pa)

For Parsec that would the following I think (but this seems to make it slower):

pTest' = many $ pa *> many (flip (\x y -> [x,y]) <$> pb <*> pa)

This helps, but not enough to beat the Parsec version. Using your benchmark, I get the following results:

 uu-parsinglib (pTest)  - 1.98s
 uu-parsinglib (pTest') - 1.11s
 Parsec (pTest)         - 0.59s
 Parsec (pTest')        - 0.67s