parsec running out of memory

329 views Asked by At

I wrote a parser for a large csv file which works on a smaller subset but runs out of memory for ~1.5m lines (the actual file). After initially parsing all elements into a list(using manyTill), i instead used the parser state to store them in a single binary search tree - this worked for the large file.

i have since split the "element type" in three separate types and want to store them in their own tree, resulting in three trees of different type. This version, though, only works for the small test file while running out of memory for the larger one.

import qualified Data.Tree.AVL as AVL
import qualified Text.ParserCombinators.Parsec as Parsec
----
data ENW = ENW (AVL.AVL Extent) (AVL.AVL Node) (AVL.AVL Way)
---- used to be Element = Extent | Node | Way  in a (Tree Element) - this worked
csvParser :: Parsec String ENW ENW
csvParser = do (Parsec.manyTill (parseL) Parsec.eof) >> Parsec.getState
    where parseL = parseLine >> ((Parsec.newline >> return ()) <|> Parsec.eof)

parseLine :: Parsec String ENW ()
parseLine = parseNode <|> parseWay <|> parseExtents

parseNode :: Parsec String ENW ()
parseNode = Parsec.string "node" *> (flip addNode <$> (Node <$> identifier <*> float <*> float)) >>= Parsec.updateState
    where identifier = Parsec.tab *> (read <$> Parsec.many1 Parsec.digit)
          float      = Parsec.tab *> (read <$> parseFloat)

addNode :: ENW -> Node -> ENW
addNode (ENW e n w) node  = (ENW e (AVL.push (sndCC node) node n) w)

parseWay and parseExtent follow the same pattern and the whole thing is started with

Parsec.runParser csvParser (ENW AVL.empty AVL.empty AVL.empty) "" input

i dont understand how using three smaller trees instead of a single large one can cause memory issues.

1

There are 1 answers

2
Axman6 On

Do you have a good reason to not use Cassava? It can be used to stream CSV data and is probably more robust than an ad hoc CSV parser. My own experience with it has shown it has excellent performance and can be easily extended to parse your own types.

Edit: It also looks like you're working with tab separated value data, not comma separated data, but Cassava lets you specify what delimiter to split columns by.It also appears that the data you have is potentially different on each line so you may need to use Cassava's 'raw' format which returns a Vector ByteString for each line, which you can then parse based on the first element.

I've never seen anyone use the AVL tree package before, is there a good reason you aren't using more standard structures? That package is quite old (Last updated in 2008) and more recent packages are likely to perform better.