parsec running out of memory

Question

parsec running out of memory

320 views Asked by CryptoCamel At 14 June 2015 at 03:25

I wrote a parser for a large csv file which works on a smaller subset but runs out of memory for ~1.5m lines (the actual file). After initially parsing all elements into a list(using manyTill), i instead used the parser state to store them in a single binary search tree - this worked for the large file.

i have since split the "element type" in three separate types and want to store them in their own tree, resulting in three trees of different type. This version, though, only works for the small test file while running out of memory for the larger one.

import qualified Data.Tree.AVL as AVL
import qualified Text.ParserCombinators.Parsec as Parsec
----
data ENW = ENW (AVL.AVL Extent) (AVL.AVL Node) (AVL.AVL Way)
---- used to be Element = Extent | Node | Way  in a (Tree Element) - this worked
csvParser :: Parsec String ENW ENW
csvParser = do (Parsec.manyTill (parseL) Parsec.eof) >> Parsec.getState
    where parseL = parseLine >> ((Parsec.newline >> return ()) <|> Parsec.eof)

parseLine :: Parsec String ENW ()
parseLine = parseNode <|> parseWay <|> parseExtents

parseNode :: Parsec String ENW ()
parseNode = Parsec.string "node" *> (flip addNode <$> (Node <$> identifier <*> float <*> float)) >>= Parsec.updateState
    where identifier = Parsec.tab *> (read <$> Parsec.many1 Parsec.digit)
          float      = Parsec.tab *> (read <$> parseFloat)

addNode :: ENW -> Node -> ENW
addNode (ENW e n w) node  = (ENW e (AVL.push (sndCC node) node n) w)

parseWay and parseExtent follow the same pattern and the whole thing is started with

Parsec.runParser csvParser (ENW AVL.empty AVL.empty AVL.empty) "" input

i dont understand how using three smaller trees instead of a single large one can cause memory issues.

Original Q&A

There are 1 answers

**Axman6** · Answer 1 · 2015-06-14T04:10:51+00:00

Do you have a good reason to not use Cassava? It can be used to stream CSV data and is probably more robust than an ad hoc CSV parser. My own experience with it has shown it has excellent performance and can be easily extended to parse your own types.

Edit: It also looks like you're working with tab separated value data, not comma separated data, but Cassava lets you specify what delimiter to split columns by.It also appears that the data you have is potentially different on each line so you may need to use Cassava's 'raw' format which returns a Vector ByteString for each line, which you can then parse based on the first element.

I've never seen anyone use the AVL tree package before, is there a good reason you aren't using more standard structures? That package is quite old (Last updated in 2008) and more recent packages are likely to perform better.

TechQA.

parsec running out of memory

There are 1 answers

Related Questions in HASKELL

Related Questions in OUT-OF-MEMORY

Related Questions in PARSEC

Popular Questions

Popular Tags

Trending Questions