Loading File into DuckDB using Python Fails Die to "/N" Values Used to Represent Nulls

230 views Asked by At

I'm trying to load a csv into Python, but the file keeps failing because one of the fields has a '\N' to represent null values in a field that is Integer. I can't figure out how to deal with this - I'd like to convert it on the way in.

It would be great if I could ignore error and insert the rest of the record into the table, but that doesn't seem to be a thing.

Any help would be much appreciated

So the following code

con.sql("INSERT INTO getNBBOtimes SELECT * FROM read_csv_auto('G:/temp/timeexport.csv')")

results in the following error

InvalidInputException                     Traceback (most recent call last)
<timed eval> in <module>

InvalidInputException: Invalid Input Error: Could not convert string '\N' to INT64 in column "column3", at line 856438.

Parser options:
  file=G:/temp/timeexport.csv
  delimiter=',' (auto detected)
  quote='"' (auto detected)
  escape='"' (auto detected)
  header=0 (auto detected)
  sample_size=20480
  ignore_errors=0
  all_varchar=0.

Consider either increasing the sample size (SAMPLE_SIZE=X [X rows] or SAMPLE_SIZE=-1 [all rows]), or skipping column conversion (ALL_VARCHAR=1)

I figured I would try to handle the error on the way in, but nothing seems to work

con.sql("CREATE TABLE test1 as seLECT NULLIF(column1,'\\N') , NULLIF(column2,'\\N'),NULLIF(column3,'\\N'),NULLIF(column4,'\\N'),NULLIF(column2,'\\N') FROM read_csv_auto('G:/temp/timeexport.csv')")

returns the following error:

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 46-47: malformed \N character escape

I tried this

con.sql("CREATE TABLE test1 as seLECT NULLIF(column1,repr('\\N')) , NULLIF(column2,repr('\\N')),NULLIF(column3,repr('\\N')),NULLIF(column4,(repr'\\N')),NULLIF(column2,repr('\\N')) FROM read_csv_auto('G:/temp/timeexport.csv')")

and got this error

CatalogException: Catalog Error: Scalar Function with name repr does not exist!
Did you mean "exp"?
1

There are 1 answers

0
larsks On

You haven't provided any sample data, so let's assume you're starting with:

id,hours_worked
1,8
2,\N
3,10
4,\N

We start by creating our target table:

>>> con = duckdb.connect()
>>> con.sql('create table getnbbotimes (id int, hours_worked int64)')

We can use a SQL IF statement to read in the file:

>>> con.sql("INSERT INTO getNBBOtimes SELECT id,if(hours_worked == '\\N',NULL,hours_worked) FROM read_csv_auto('timeexport.csv')")

Which gets us:

>>> con.sql('select * from getnbbotimes')
┌───────┬──────────────┐
│  id   │ hours_worked │
│ int32 │    int64     │
├───────┼──────────────┤
│     1 │            8 │
│     2 │         NULL │
│     3 │           10 │
│     4 │         NULL │
└───────┴──────────────┘

...which is what I think you were after.


You can make your solution using NULLIF work if you're willing to treat all columns as VARCHAR:

>>> con.sql("CREATE TABLE test1 as select NULLIF(id,'\\N') 
... as id, NULLIF(hours_worked,'\\N') as hours_worked
... FROM read_csv_auto('timeexport.csv', all_varchar=1)")

Which gets us:

>>> con.sql('select * from test1')
┌─────────┬──────────────┐
│   id    │ hours_worked │
│ varchar │   varchar    │
├─────────┼──────────────┤
│ 1       │ 8            │
│ 2       │ NULL         │
│ 3       │ 10           │
│ 4       │ NULL         │
└─────────┴──────────────┘

You could then use a second select to convert those varchar values to int64.