Need help in identifying what this piece of code does EXACTLY (I have a general idea)

52 views Asked by At
let $removeLastCR:=fn:replace($output.output_FileContent , '(\r?\n|\r)$', '')
let $lines := tokenize($removeLastCR, '\n')
return

for $line at $counter in $lines
let $x := 
    for $i in fn:tokenize($line,'"')
    return 
        if(fn:starts-with($i,',') and fn:ends-with($i,','))
            then fn:substring($i,2,fn:string-length($i)-2)
        else 
            if(fn:starts-with($i,','))
                then fn:substring-after($i,',')
            else 
                if(fn:ends-with($i,','))
                    then fn:substring($i,1,fn:string-length($i)-1)
                else $i
let $fields :=
    for $j at $k in $x
    return 
    if(fn:starts-with($line,'"'))
        then 
            if($k mod 2 = 0)
                then fn:tokenize($j,',')
            else $j
    else 
        if($k mod 2 = 0)
            then $j
        else fn:tokenize($j,',')
return

The real issue is that I am trying to understand why the parsing fails for the below data record, but works for the rest of the data in the file (File is a .CSV file):

[email protected],XYZ LastName,Merchant,15/08/2022,199.98,USD,199.98,USD,61001,,,[email protected] | R1111111,"Qty 10- 4"" X 4"" X 5.7"" - Color: Custom Box 1",,XYZ,CC 1 August,R1111111,P&E \: PS mama,,policyid,CCP,https://www.example.com/report?reportID=R1111111,cdf,1234XXXXXX5678,https://example.com,

For the above record, the code should have parsed each comma separated value into it's own field,(Field1: [email protected], Field2: XYZ LastName etc) but I think it falls apart on the field value "Qty 10- 4"" X 4"" X 5.7"" - Color: Custom Box 1". It SHOULD parse this whole value into 1 field, but it only gets "Qty 10- 4" into Field#13. And all the fields after this are also all not parsed properly.

So I was trying to better understand this code (someone else wrote it) so i can make the appropriate changes to handle this scenario.

1

There are 1 answers

0
Michael Kay On

There are many variants of CSV syntax, and it looks as if this data file is using a convention of escaping " within a quoted field by doubling the quotation marks. But the query code that parses the CSV is making no attempt to handle such escaped quotation marks.

It's easy to tell you what each line of code does, but I suspect that's not your problem. What's harder is to understand the overall logic.

The first part creates a variable $x by tokenizing on " and removing leading and trailing commas from every token. That makes no sense to me. The second part then takes the tokens that weren't in quotes and splits them on "," separators.

I think this code is already fairly broken, and it certainly can't be extended to handle quotes that are escaped by doubling. It needs to be rewritten. I don't think it can be done with simple tokenisation.

A bit of googling shows a variety of attempts at CSV to XML converters. Unfortunately few of them are very explicit about exactly what flavour of CSV they handle, and many don't even attempt to handle commas within quoted fields. Not very satisfactory, but I'm afraid that writing a better one isn't possible within the 10 minutes I allow myself for answering StackOverflow questions.