I am trying to parse (using parsec) a string that represents some data type that I defined. Thus the string needs to be parsed to my data type. An example of the string would be,
[(1,[(<,0),(%,4)]), (2,[(>=, 4)])]
This would parse to the following,
[(Reg 1, [Cmp (Jlt, Intv (0, 0)), Op (Mod, Intv (-4,4))]), (Reg 2, [Cmp (Jge, (4,4))])]
Now this makes use of a few custom data types,
newtype Reg = Reg Int deriving (Eq, Show, Ord)
data LF = Op (BinAlu, Interval) | Cmp (Jcmp, Interval) | Invalid
deriving (Eq, Show, Ord)
data BinAlu
= Add
| Sub
| Mul
| Div
| Or
| And
| Lsh
| Rsh
| Mod
| Xor
| Mov
| Arsh
deriving (Eq, Show, Ord, Enum)
data Jcmp = Jeq | Jgt | Jge | Jlt | Jle | Jset | Jne | Jsgt | Jsge | Jslt | Jsle
deriving (Eq, Show, Ord, Enum)
data Interval = Bot | Intv (Int, Int)
deriving (Eq, Show, Ord)
Thus I want to parse the string to be the following type [(Reg, [LF])]
Now I am quite lost on how to actually do this. I think I have an idea, but I find it difficult implementing that idea.
My idea is to first use, between (symbol "[") (symbol "]"), which hopyfully would give me the contents between [ and ]. Then I need to do something similar for the paranthesis but repeat it. And then of course parsing the contents within the paranthesis.
I am basically looking for any advice on how to setup this parser. And how in general to structure such a parser.
Any help is greatly appreciated!
The following should get you started. We'll need some imports:
In order to properly process whitespace, you should begin by writing some combinators to handle "lexemes", parsers that expect to start at a non-whitespace character, parse something, and discard trailing whitespace. While Parsec has some lexeme support in
Text.Parsec.Token, it's over-designed and hard to use. Here's a simplified alternative, based on the Megaparsec approach:The following are pretty standard lexemes for parsing numbers:
And here are some other pretty standard lexemes/combinators:
Here's a helper for lists, since you use it in a couple places.
Now, we should define the lowest-level "atoms" of your grammar:
For
binaluandjcmp, a simple first attempt might look like this:and this is sufficient to parse your example input. However, there's a problem here when you flesh these out with all your desired operators. The parser
symbol "<"for example, will happily parse the first character of"<=", leaving the"="to cause an error when you try to parse the comma next. If you order the alternatives to try"<="first:this still isn't sufficient, because
symbol "<="will happily start parsing a"<"not followed by a"="and then "fail after consuming input", which prevents any later alternatives from being tried. You can use thetrycombinator to continue anyway:but this is tedious to get right. The usual solution is to define a list of "operator characters":
and define an
operatorcombinator (note: Parsec calls this combinatorreservedOp) that parses an operator followed by something other than an operator character:Now, you can list operators in any order, and they'll work fine:
Finally, we can define the grammar for your higher-level structures. Note that the topmost parser should ignore leading whitespace, as all the lexeme parsers expect to start with non-whitespace, and check for end-of-input.
Here's a test on your proposed input:
which should produce the output:
Full code: