Learning Haskell `parsec`: trying to rewrite `words` function as a basic exercise

325 views Asked by At

This is an extremely basic question and I honestly feel a bit silly writing it.

TL;DR: How can I write a function which makes use of parsec library to mimic the behavior of the words function from Data.List? An example of the intended behavior:

wordsReplica "I love lamp" = ["I","love","lamp"]

I just read the first couple pages of the Parsec chapter from Real World Haskell and it would be incredibly helpful to understand what constitutes a bare-minimum parsing function (one that does more than return the argument or return nothing). (RWH's introductory example shows how to parse a multi-line CSV file...)

As such, I thought it'd be a useful, basic exercise to rewrite words using parsec... It's turning out to be not so basic (for me)...

The following is my attempt; unfortunately it generates an "unexpected end of input" error (at runtime) no matter what I give it. I've tried reading the descriptions/definitions of the simple functions in the parsec library on haskell.org but they aren't that illustrative, atleast for someone who's never done parsing of any kind before, including in other languages.

testParser :: String -> Either ParseError [[String]]
testParser input = parse dcParser "(unknown)" input
  where
    wordsReplica = endBy 
                    (sepBy 
                      (many (noneOf " "))
                      (char ' '))
                    (char ' ')

(Please pardon the lisp-y, non-pointfree presentation - when I'm learning about a new function, it helps me if I make the notation/structure super explicit.)

Update:
Here's something that's a step in the right direction (but still not quite there as it doesn't do numbers):

λ: let wordsReplica = sepBy (many letter) (char ' ')
λ: parse wordsReplica "i love lamp 867 5309"
Right ["i","love","lamp",""]

Update 2:

Seems like this function gets the job done, though am not sure how idiomatic it is:

λ: let wordsReplica = sepBy (many (satisfy(not . isSpace))) (char ' ')
wordsReplica :: Stream s m Char => ParsecT s u m [[Char]]

λ: parse wordsReplica "" "867 5309 i love lamp %all% !(nonblanks are $$captured$$"

Right ["867","5309","i","love","lamp","%all%","!(nonblanks","are","$$captured$$"]
it :: Either ParseError [[Char]]
1

There are 1 answers

0
Zeta On BEST ANSWER

Update 2:

Seems like this function gets the job done, though am not sure how idiomatic it is.

It's fine, but it doesn't work as you intend:

> words "Hello      world"
["Hello","world"]

> parse wordsReplica "" "Hello      world"
Right ["Hello","","","","","","world"]

Not quite what you want. After all, a word should consist of at least one character. But if you change many to many1, you will notice another error:

> parse wordsReplicaMany1 "" "Hello      world"
Left (line 1, column 7):
unexpected " "

That's because your separating parser isn't greedy enough. Instead of parsing a single space, parse as many as you can:

nonSpace      = satisfy $ not . isSpace
wordsReplica' = many1 nonSpace `sepBy` spaces