HIVE JDBC Connection Using Pyspark returns Column names as row values

1k views Asked by At

I am using Pyspark to connect to HIVE and fetch some data. The issue is that it returns all rows with the values that are column names. It is returning correct column names. Only the Row values are incorrect.

Here is my Code

hive_jar_path="C:Users/shakir/Downloads/ClouderaHiveJDBC-2.6.11.1014/ClouderaHiveJDBC-2.6.11.1014/ClouderaHiveJDBC42-2.6.11.1014/HiveJDBC42.jar"
print(hive_jar_path)
print("")

import os
os.environ["HADOOP_HOME"]="c:/users/shakir/downloads/spark/spark/spark"
import os
os.environ["SPARK_HOME"]="c:/users/shakir/downloads/spark/spark/spark"
import findspark
findspark.init()



from pyspark import SparkContext, SparkConf, SQLContext
from pyspark.sql import SparkSession

import uuid
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL Hive integration example") \
    .config("spark.sql.warehouse.dir", "hdfs://...../user/hive/warehouse/..../....")
    

spark.config("spark.driver.extraClassPath", hive_jar_path)
spark.config("spark.sql.hive.llap", "true")
spark.config("spark.sql.warehouse.dir", "hdfs://...../user/hive/warehouse/..../....")


spark=spark.enableHiveSupport().getOrCreate()

import databricks.koalas as ks


print("Reading Data from Hive . . .")
options={
    "fetchsize":1000,
    "inferSchema": True,
    "fileFormat":"orc",
    "inputFormat":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat",
    "outputFormat":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat",
    "driver":"org.apache.hive.jdbc.HiveDriver",
    }
df = ks.read_sql("SELECT * FROM PERSONS LIMIT 3", connection_string,options=options)
print("Done")
print(df)

Output of the code:

+------+-----+---------+
| Name | Age | Address |
+------+-----+---------+
| Name | Age | Address |
+------+-----+---------+
| Name | Age | Address |
+------+-----+---------+
| Name | Age | Address |
+------+-----+---------+
1

There are 1 answers

1
Jose R On

my situation was not the same as yours, as I was connecting to Sparkthru jdbc however I was getting the same results as you did: column names returned as literals.

I fixed it by overriding the JdbcDialect. This answer has a nice explanation of why and some useful links as well 1, 2.

import org.apache.spark.sql.jdbc.JdbcDialect
private case object HiveDialect extends JdbcDialect {
  override def canHandle(url : String): Boolean = url.startsWith("jdbc:hive2")
  override def quoteIdentifier(colName: String): String = {
    colName.split(‘.’).map(part => s”`$part`”).mkString(“.”)
  }
}
//Register dialect
JdbcDialects.registerDialect(HiveDialect)

Just pay attention to the quoting characters, medium has this weird typography for quoting characters.

Hope it helps after this long. Leaving it here for when I forget about it and google it again :p