Parsec separator / terminator

342 views Asked by At

Apparently I'm too dumb to figure this out...

Consider the following string:

foobar(123, 456, 789)

I'm trying to work out how to parse this. In particular,

call = do
  cs <- many1 letter
  char '('
  as <- many argument
  return (cs, as)

argument = manyTill anyChar (char ',' <|> char ')')

This works perfectly, until I add stuff to the end of the input string, at which point it tries to parse that stuff as the next argument, and gets upset when it doesn't end with a comma or bracket.

Fundamentally, the trouble is that a comma is a separator, while a bracket is a terminator. Parsec doesn't appear to provide a combinator for that.

Just to make things more interesting, the input string can also be

foobar(123, 456, ...

which indicates that the message is incomplete. There appears to be no way of parsing a sequence with two possible terminators and knowing which one was found. (I actually want to know whether the argument list was complete or incomplete.)

Can anyone figure out how I climb out of this?

1

There are 1 answers

1
kosmikus On BEST ANSWER

You should exclude your separator/terminator characters from the allowed characters for a function argument. Also, you can use between and sepBy to make the difference between separators and terminators clearer:

call = do
  cs <- many1 letter
  as <- between (char '(') (char ')')
      $ sepBy (many1 (noneOf ",)")) (char ',')
  return (cs, as)

However, this is probably still not what you want, because it doesn't handle whitespace properly. You should look at Text.Parsec.Token for a more robust way to do this.

Edit

With the ...-addition, it indeed becomes a bit weird, and I don't think it nicely fits into any of the predefined combinators, so we'll have to just do it ourselves.

Let's define a type for our results:

data Args = String :. Args | Nil | Dots
  deriving Show

infixr 5 :.

That's like a list, but it has two different kinds of "empty list" to distinguish the ... case. Of course, you can also use ([String], Bool) as a result type, but I'll leave that as an exercise. The following assumes we have

import Control.Applicative ((<$>), (<*>), (<$), (*>))

The parsers become:

call = do
  cs <- many1 letter
  char '('
  as <- args
  return (cs, as)

args = do
      (:.) <$> arg <*> argcont
  <|> Dots <$ string "..."

arg = many1 (noneOf ".,)")

argcont =
      Nil <$ char ')'
  <|> char ',' *> args

This handles everything fine except whitespace, for which my original recommendation to look at token parsers remains.

Let's test:

GHCi> parseTest call "foobar(foo,bar,baz)"
("foobar","foo" :. ("bar" :. ("baz" :. Nil)))
GHCi> parseTest call "foobar(1,2,..."
("foobar","1" :. ("2" :. Dots))
GHCi> parseTest ((,) <$> call <*> call) "foo(1)bar(2,...)"
(("foo","1" :. Nil),("bar","2" :. Dots))