I'm trying to run a SELECT query over a JSON file using Apache Drill. I'm getting various errors for different files. All errors are JSON Parsing errors:
Error: DATA_READ ERROR: Error parsing JSON - Invalid UTF-8 middle byte 0x3f
Error: DATA_READ ERROR: Error parsing JSON - Illegal unquoted character ((CTRL-CHAR, code 13)): has to be escaped using backslash to be included in string value
Error: DATA_READ ERROR: Error parsing JSON - Illegal character ((CTRL-CHAR, code 0)): only regular white space (\r, \n, \t)
For first error which is due to the presence of certain characters such as '趨勢科技å': I've already tried below:
- Using
Convert_To
andConvert_From
on the field that contains invalid UTF-8 chars (Didn't work. Don't think these functions are meant for this purpose.) - Add
-Dsaffron.default.charset=UTF-16LE
to DRILL_JAVA_OPTS inconf/drill-env.sh
(Didn't work as it turns out that this option is to be used if your query, not your data, contains invalid UTF-8 characters) - Changed file encoding to UTF-8 using Notepad++ (Didn't work. Was expecting this to work though)
- Tried changing file encoding to UTF-8 without BOM using Notepad++ (Notepad++ was unable to convert it. After saving, when opened again it was ANSI)
Change the encoding to 'UTF-8 with BOM' using either:
and, you will be able to query it using Apache Drill.
I used iConv to change the file encoding to 'UTF-8', when the converted file was opened using Notepad++, the encoding displayed by Notepad++ was 'UTF-8 with BOM'. So, I changed the encoding of the original file to 'UTF-8 with BOM' using Notepad++ itself, it worked too.
Both of the files, the one converted using iConv and one converted using Notepad++ (basically, any file converted to 'UTF-8 with BOM') was parsable using Apache Drill.
To convert:
Note: For large files, you might need to split it before conversion as in my case Notepad++ was not able to open 2GB file and iConv too was not able to convert it.