I am trying to load data from RDBMS to a hive table on HDFS. I am reading the RDBMS table in the below way:
val mydata = spark.read
.format("jdbc")
.option("url", connection)
.option("dbtable", "select * from dev.userlocations")
.option("user", usrname)
.option("password", pwd)
.option("numPartitions",20)
.load()
I see in the executor logs that the option("numPartitions",20)
is not given properly and the entire data in dumped into a single executor.
Now there are options to provide the partition column, lower bound & upper bound as below:
val mydata = spark.read
.format("jdbc")
.option("url", connection)
.option("dbtable", "select * from dev.userlocations")
.option("user", usrname)
.option("password", pwd)
.option("partitionColumn","columnName")
.option("lowerbound","x")
.option("upperbound","y")
.option("numPartitions",20).load()
The above one only works if I have the partition column is of numeric datatype. In the table I am reading, it is partitioned based on a column location
. It is of size 5gb overall & there are 20 different partitions in the table based on. I have 20 distinct locations in the table. Is there anyway I can read the table in partitions based on the partition column of the table: location ?
Could anyone let me know if it can be implemented at all ?
You can use the predicates option for this. It takes an array of string and each item in the array is a condition for partitioning the source table. Total number of partitions determined by those conditions.