How to make Apache Drill parse a JSON file with invalid UTF8 characters

1.2k views Asked by At

I'm trying to run a SELECT query over a JSON file using Apache Drill. I'm getting various errors for different files. All errors are JSON Parsing errors:

  • Error: DATA_READ ERROR: Error parsing JSON - Invalid UTF-8 middle byte 0x3f

  • Error: DATA_READ ERROR: Error parsing JSON - Illegal unquoted character ((CTRL-CHAR, code 13)): has to be escaped using backslash to be included in string value

  • Error: DATA_READ ERROR: Error parsing JSON - Illegal character ((CTRL-CHAR, code 0)): only regular white space (\r, \n, \t)

For first error which is due to the presence of certain characters such as '趨勢科技å': I've already tried below:

  • Using Convert_To and Convert_From on the field that contains invalid UTF-8 chars (Didn't work. Don't think these functions are meant for this purpose.)
  • Add -Dsaffron.default.charset=UTF-16LE to DRILL_JAVA_OPTS in conf/drill-env.sh (Didn't work as it turns out that this option is to be used if your query, not your data, contains invalid UTF-8 characters)
  • Changed file encoding to UTF-8 using Notepad++ (Didn't work. Was expecting this to work though)
  • Tried changing file encoding to UTF-8 without BOM using Notepad++ (Notepad++ was unable to convert it. After saving, when opened again it was ANSI)
1

There are 1 answers

0
Sid_M On BEST ANSWER

Change the encoding to 'UTF-8 with BOM' using either:

  • Notepad++
  • iConv (a shell utility)

and, you will be able to query it using Apache Drill.

I used iConv to change the file encoding to 'UTF-8', when the converted file was opened using Notepad++, the encoding displayed by Notepad++ was 'UTF-8 with BOM'. So, I changed the encoding of the original file to 'UTF-8 with BOM' using Notepad++ itself, it worked too.

Both of the files, the one converted using iConv and one converted using Notepad++ (basically, any file converted to 'UTF-8 with BOM') was parsable using Apache Drill.

To convert:

  • Using Notepad++ : from the menu bar select encoding, change it to 'UTF8 with BOM' and save the file. If this encoding is not displayed in the encoding list, there might be some plugin(or some other way) to make it available in Notepad++.
  • Using iConv : Download the utility and run it with this command: iconv -f old-encoding -t new-encoding(UTF-8 in this case) file.txt > newfile.txt

Note: For large files, you might need to split it before conversion as in my case Notepad++ was not able to open 2GB file and iConv too was not able to convert it.