Boost tokenizer fails to parse csv file having field with double quote

349 views Asked by At

I am parsing.csv file having two columns. I am trying to parse row using boost tokenizer from csv file in which one of field in row is in double quote(Ex: 1,"test"). After tokenizer, I am getting field without double quote in tok (1,test).

typedef tokenizer< escaped_list_separator<char>> Tokenizer;
if (getline(inputFile, line))
{
    Tokenizer tok(line);        
    vector< string > vec;
    vec.assign(tok.begin(), tok.end());

    //Here *(vec.begin() + 1) is printing string- test , without double quote
}

Is there any way to get this second field with double quote?

1

There are 1 answers

3
sehe On

The quotes are a presentation thing. Once you parse/tokenize the data, you want the unescaped data back.

The quoted/escaped representation is to protect special characters in your data in transit only (to prevent them from interfering with your protocol¹).

Once you read it back, it is no longer in transit, and to "keep" the escapes or quotes (or whatever other artefacts come with your protocol¹) would be an error, and in fact is a frequent source of bugs, not seldom security vulnerabilities

Samples

  • CSV a or "a" corresponds to a value of a
  • likewise "\"" corresponds to "
  • "\\\"" corresponds to \"
  • "\" is incomplete (the quoted construct is not closed)

The important thing is that your values roundtrip without loss of information. So, parsing "a" as the value "a" creates the conceptual error that converting it back to quoted-escaped format would suddenly look like "\"a\"", which is an entirely different thing!


¹ presentation format or transport protocol

² most commonly, code injection:

Code injection vulnerabilities (injection flaws) occur when an application sends untrusted data to an interpreter. Injection flaws are most often found in SQL, LDAP, XPath, or NoSQL queries; OS commands; XML parsers, SMTP headers, program arguments, etc. Injection flaws tend to be easier to discover when examining source code than via testing.[1] Scanners and fuzzers can help find injection flaws.[2]