Using parboiled2 to parse multiple lines instead of a String

1k views Asked by At

I would like to use parboiled2 to parse multiple CSV lines instead of a single CSV String. The result would be something like:

val parser = new CSVRecordParser(fieldSeparator)
io.Source.fromFile("my-file").getLines().map(line => parser.record.run(line))

where CSVRecordParser is my parboiled parser of CSV records. The problem that I have is that, for what I've tried, I cannot do this because parboiled parsers requires the input in the constructor, not in the run method. Thus, I can either create a new parser for each line, that is not good, or find a way to pass the input to the parser for every input that I have. I tried to hack a bit the parser, by setting the input as variable and wrapping the parser in another object

object CSVRecordParser {

  private object CSVRecordParserWrapper extends Parser with StringBuilding {

    val textBase = CharPredicate.Printable -- '"'
    val qTextData = textBase ++ "\r\n"

    var input: ParserInput = _
    var fieldDelimiter: Char = _

    def record = rule { zeroOrMore(field).separatedBy(fieldDelimiter) ~> (Seq[String] _) }
    def field = rule { quotedField | unquotedField }
    def quotedField = rule {
      '"' ~ clearSB() ~ zeroOrMore((qTextData | '"' ~ '"') ~ appendSB()) ~ '"' ~ ows ~ push(sb.toString)
    }
    def unquotedField = rule { capture(zeroOrMore(textData)) }
    def textData = textBase -- fieldDelimiter

    def ows = rule { zeroOrMore(' ') }
  }

  def parse(input: ParserInput, fieldDelimiter: Char): Result[Seq[String]] = {
    CSVRecordParserWrapper.input = input
    CSVRecordParserWrapper.fieldDelimiter = fieldDelimiter
    wrapTry(CSVRecordParserWrapper.record.run())
  }
}

and then just call CSVRecordParser.parse(input, separator) when I want to parse a line. Besides the fact that this is horrible, it doesn't work and I often have strange errors related to previous usages of the parser. I know this is not the way I should write a parser using parboiled2 and I was wondering what is the best way to achieve what I would like to do with this library.

2

There are 2 answers

2
David On BEST ANSWER

I've done this for CSV files of over 1 million records, in a project that requires high speed and low resources, and I find it works well to instantiate a new parser for each line.

I tried this approach after I noticed that the parboiled2 readme mentions that the parsers are extremely light weight.

I have not needed even to increase JVM memory or heap limits from their defaults. Parser instantiation for each line works very well.

1
dberry On

Why not add an end of record rule to the parser.

def EOR = rule { "\r\n" | "\n" }

def record = rule { zeroOrMore(field).separatedBy(fieldDelimiter) ~ EOR ~> (Seq[String] _) }

Then you can pass in as many lines as you want.