I am reading 30M records from oracle table with no primary key columns. spark jdbc reading hangs and not fetching any data. where i can get the result from Oracle SQLDeveloper within few seconds for same query.
oracleDf = hiveContext.read().format("jdbc").option("url", url)
.option("dbtable", queryToExecute)
.option("numPartitions ","5")
.option("fetchSize","1000000")
.option("user", use).option("password", pwd).option("driver", driver).load().repartition(5);
i cannot use partition columns as i do not have primary key column. can anyone advice to improve performance.
Thanks
There are many a things that can be used to optimize your DF creation. You might want to drop
repartition
and also use predicates to parallelize data retrieval process for Spark actions.If the filter is not based on primary key or an indexed column, exploring
ROWID
is a possibility.