Question
I know Parsec
and uu-parsinglib
and I've written parsers in both of them. Recently I discovered, that there is a problem in uu-parsinglib
, which could significantly affect its performance and I do not see a way to solve it.
Lets consider following Parsec parsers:
pa = char 'a'
pb = char 'b'
pTest = many $ try(pa <* pb)
What would be the equivalent in uu-parsinglib
? It would not be the following:
pa = pSym 'a'
pb = pSym 'b'
pTest = pList_ng (pa <* pb)
The difference is, that in Parsec
, many
would eat (pa <* pb)
(pairs of "ab"
) greedy until it is not matched anymore, while in uu-parsinglib
, pList_ng
is not greedy, so it will keep in memory possible backtrack ways after parsing each (pa <* pb)
.
Is there any way to write something like pList(try(pa <* pb))
in uu-parsinglib
?
Example
A good example would be
pExample = pTest <* (pa <* pb)
and a sample input of "ababab"
.
With Parsec
, we would get error (because pTest
is greedy parsing pairs of "ab"
), but in uu-parsinglib
, the string would be parsed with no problems.
Edit
We cannot switch from pList_ng
to pList
, because it would be not equivalent to Parsec
version. For example:
pExample2 = pTest <* pa
and a sample input of "ababa"
would success in Parsec
, but fail in uu-parsinglib
, when using greedy pList
.
Of course uu-parsinglib
will succeed if we use pList_ng
here, but it could be a lot slower for some inputs and rules. For example, considering the input "ababab"
, Parsec
would simply fail, because pTest
would consume whole string and pa
would not be matched. uu-parsinglib
will fail also, but checking a more steps - it will match whole string and fail, then throw away last "ab"
pair and fail again, etc. If we have some nested rules and funny input text, it will make a huge difference.
A little benchmark
To prove, that the problem exists in real, consider following grammar (in a pseudocode - but I think it is very intuitive):
pattern = many("ab") + "a"
input = many(pattern)
So as input to our program we get a string containing patterns, for example "abababaaba" contains 2 patterns "abababa" and "aba".
Lets make parsers in both libraries!
uu-parsinglib
:
import Data.Char
import qualified Text.ParserCombinators.UU as UU
import Text.ParserCombinators.UU hiding(parse)
import Text.ParserCombinators.UU.BasicInstances hiding (Parser)
import System.TimeIt (timeIt)
pa = pSym 'a'
pb = pSym 'b'
pTest = pList $ pList_ng ((\x y -> [x,y]) <$> pa <*> pb) <* pa
main :: IO ()
main = do
timeIt maininner
return ()
maininner = do
let (x,e) = UU.parse ((,) <$> pTest <*> pEnd) (createStr (LineColPos 0 0 0) (concat $ replicate 1000 (concat (replicate 1000 "ab") ++ "a")))
print $ length x
Parsec
:
import Control.Applicative
import Text.Parsec hiding (many, optional, parse, (<|>))
import qualified Text.Parsec as Parsec
import System.TimeIt (timeIt)
pa = char 'a'
pb = char 'b'
pTest = many $ many (try ((\x y -> [x,y]) <$> pa <*> pb)) <* pa
main :: IO ()
main = do
timeIt maininner2
return ()
maininner2 = do
let Right x = Parsec.runParser pTest (0::Int) "test" $ (concat $ replicate 1000 (concat (replicate 1000 "ab") ++ "a"))
print $ length x
Result? uu-parsinglib
is > 300% slower:
uu-parsinglib - 3.19s
Parsec - 1.04s
(compiled with ghc -O3
flag)
To understand the subtleties it is important to understand the differences between the try construct in Haskell and the non-greedy parsing strategy used in uu-parsinglib. Effectively the latter is a try which just looks ahead one symbol. In that respect it is less powerful than the try construct from Parsec, in which you specify that a specific construct has to be present completely. And then there is the underlying different overall strategy. Parsec uses a back-tracking strategy with explicit tries to commit, whereas uu-parsinglib uses a breadth-first strategy with an occasional single symbol look-ahead.
So it does not come as a surprise that there is a time difference between the two. In the Parsec case it is decided that the tried alternative does apply after having seen the complete construct (two symbols), whereas the greedy uu-parsinglib decides that this must be the right alternative after having successfully seen the first symbol. And this conclusion may be unjustified.
If one moves to the breadth-first strategy uu-parsinglib uses one has to keep track of several alternatives at the same time, and this take time. Two alternative, twice the time, etc.
The advantage of Parsec is that you can tune a back-tracking parser by liberal use of try constructs and by placing alternatives in the right order, but you are also more likely to ask questions on mailing list about why your parsers do not work as expected. You are not so much writing a grammar as writing parser. The uu-parsinglib starts from the other end of the spectrum: we try to accept quite a large collection of grammars (and the parsers implied by them).
My feeling is also that in the presence of try constructs having excellent error-repairing parsers is much more complicated. Once a try construct fails it is not possible to go back there and decide that, with a small correction, it is much better than the alternatives that come after it.