I'm trying to perform a read in spark from oracle
String drivingQuery = "(SELECT ID,MESSAGE FROM ALL_MESSAGES WHERE ID IN (100,101,102,103))TBL";
String driver = "oracle.jdbc.driver.OracleDriver";
Dataset<Row> resultDf = spark
.read()
.format("jdbc")
.option("url", drivingUrl)
.option("dbtable", drivingQuery)
.option("driver",driver)
.load();
resultDf.show(200,false);
However, this gives a weird output with some weird characters
+----------------------------+-------+
|ID |MESSAGE|
+----------------------------+-------+
|100.0000000000 |���� |
|101.0000000000 |���� |
|102.0000000000 |���� |
|103.0000000000 |���� |
+----------------------------+-------+
I tried various methods like using the encoding option in spark read. However, that didn't work as well. The only solution I have is to base64 encode in oracle db while reading
String query = "select ID,utl_encode.base64_encode(utl_raw.cast_to_raw(MESSAGE)) AS ENCODED_MESSAGE from ALL_MESSAGES WHERE ID IN (100,101,102,103)";
This gives the expected output
+-----------------+-------+
|TO_CHAR(ID) |MESSAGE|
+-----------------+-------+
|100 | |
|101 | |
|102 | |
|103 | |
+-----------------+-------+
I have the following dependency
<dependency>
<groupId>com.oracle.ojdbc</groupId>
<artifactId>ojdbc8</artifactId>
<version>19.3.0.0</version>
</dependency>
<dependency>
<groupId>com.oracle.ojdbc</groupId>
<artifactId>orai18n</artifactId>
<version>19.3.0.0</version>
</dependency>
and the database has NLS_NCHAR_CHARACTERSET = AL16UTF16 and NLS_CHARACTERSET= UTF8
I want to know if there is any alternate way to read without doing the encode-decode part as we have this code in multiple places and replacing everything would take long time.
In case of Oracle the client receives data in databases "native" encoding, which it most likely AL32UTF8 and it is up to client to convert it into Java's native encoding UTF-16.
There is also Oracle internal function
dump()which will return internal representation of string(VARCHAR2) inside database.If JDBC drivers fail to convert UTF8 into UTF16 properly it is possible that you found a bug in JDBC drivers.