Take strings as tokens parsing with Parsec

Question

Take strings as tokens parsing with Parsec

148 views Asked by 00M At 05 March 2023 at 19:45

I am using Parsec library to parse a string. The problem I have is that can't differentiate some tokens because they are words with the same prefix. Simplifying the whole grammar (in my case it's not a regular one), say we have the following:

T0 := <empty> | tag0 T1 T0
T1 := tag1 | tag1 T1

So I may have strings like "tag0tag1tag1" or like "tag0tag1tag0tag1" etc., basically we have a "tag0" string followed by an arbitrary (not zero) number of "tag1" string, and all this could also be repeated any number of times.

So what I tried was something like:

wrongParser :: Parser String
wrongParser = do
    string "tag0"
    many $ string "tag1"
    return "Ok"

And tested with

ghci> parse wrongParser "Error" "tag0tag1tag1tag0tag1"
Left "Error" (line 1, column 13):
unexpected "0"
expecting "tag1"

So what seems to happen here is that the parser read "tag" from "tag0", but it is expecting "tag1" instead (because is still reading many of "tag1").

Is there a way to make the parser to take the tag string as a whole so that instead of failing it just assumes that all the many tag1 are already read and stop with no error (maybe another function than string)? Or what is the correct way to handle this case?

Original Q&A

There are 1 answers

**lsmor** · Accepted Answer · 2023-03-06T11:38:07+00:00

This is a common misconception with Parsec:

A parser consumes char by char, by default. (these are called tokens in Parsec)
A failure in Parsec is a parser which comsumes less output than expected
Parsers do not backtrack unless you specify so.

Therefore, your parser works like this

string to consume: "tag0tag1tag1tag0tag1"
string consumed: ""

-- *********************
-- Parser: string "tag0"
-- *********************

-- 1st token
string to consume: "ag0tag1tag1tag0tag1"
string consumed: "t"

-- 2nd token
string to consume: "g0tag1tag1tag0tag1"
string consumed: "ta"

-- 3th token
string to consume: "0tag1tag1tag0tag1"
string consumed: "tag"

-- 4th token
string to consume: "tag1tag1tag0tag1"
string consumed: "tag0"

-- ***********************
-- done with string "tag0"
-- ***********************

-- ****************************
-- Parser: many $ string "tag1" 
-- ****************************

-- convince yourself you'll reach this point
string to consume: "tag0tag1"
string consumed: "tag0tag1tag1"

-- keep in mind we are executing (string "tag1")
-- because you put it inside many.
-- So you end up with 
string to consume: "0tag1"
string consumed: "tag0tag1tag1tag"

-- after this, you consume a 0 and fail.
string to consume: "tag1"
string consumed: "tag0tag1tag1tag" 0 ???

Probably you want to say, "if you fail reading tag1, just pretend you haven't read it at all". This is called backtracking and you use the try combinator to make it happen

rightParser :: Parser String
rightParser = do
  string "tag0"
  many1 $ try (string "tag1") -- many1 because tag0 must be followed by at least one tag1. try because you want to backtrack tag1 if failure.
  return "Ok"

Now if you want to parse the whole string you use many rightParser. For the sake of example, let's say you want to return all "tag1"s. Then

rightParser :: Parser [String]
rightParser = do
  string "tag0"
  many1 $ try (string "tag1") -- same as before, by returns a list of tag1 instead of "Ok"

wholeParser = many rightParser

main = print $ parse wholeParser "Error" "tag0tag1tag1tag0tag1"

> Right [["tag1","tag1"],["tag1"]]

Notice, that other libraries (ex: Attoparsec) do always backtrack on failure. This is a design decision made by each library. Also, backtracking can be an expensive operation, so you may want to write your parser differently (example: always parse "tag" and backtrack only on "0" or "1")

TechQA.

Take strings as tokens parsing with Parsec

There are 1 answers

Related Questions in HASKELL

Related Questions in PARSEC

Popular Questions

Trending Questions