I am trying to load data from RDBMS to a hive table on HDFS. I am reading the RDBMS table in the below way:

val mydata = spark.read
  .format("jdbc")
  .option("url", connection)
  .option("dbtable", "select * from dev.userlocations")
  .option("user", usrname)
  .option("password", pwd)
  .option("numPartitions",20)
  .load()

I see in the executor logs that the option("numPartitions",20) is not given properly and the entire data in dumped into a single executor.

Now there are options to provide the partition column, lower bound & upper bound as below:

val mydata = spark.read
  .format("jdbc")
  .option("url", connection)
  .option("dbtable", "select * from dev.userlocations")
  .option("user", usrname)
  .option("password", pwd)
  .option("partitionColumn","columnName")
  .option("lowerbound","x")
  .option("upperbound","y")
  .option("numPartitions",20).load()

The above one only works if I have the partition column is of numeric datatype. In the table I am reading, it is partitioned based on a column location. It is of size 5gb overall & there are 20 different partitions in the table based on. I have 20 distinct locations in the table. Is there anyway I can read the table in partitions based on the partition column of the table: location ?

Could anyone let me know if it can be implemented at all ?

1

There are 1 answers

0
sgungormus On

You can use the predicates option for this. It takes an array of string and each item in the array is a condition for partitioning the source table. Total number of partitions determined by those conditions.

val preds = Array[String]("location = 'LOC1'", "location = 'LOC2' || location = 'LOC3'")

val df = spark.read.jdbc(
  url = databaseUrl,
  table = tableName,
  predicates = preds,
  connectionProperties = properties
)