i am trying to load all partitions data from S3 using spark, but i am not able to do so. here is how the data is stored in S3, where statusts is partition column.
i am using below java code to load the data.
Dataset<Row> archivalDS = session.read().format("avro").option("basePath", folder_path)
.load(folder_path);
where folder_path variable value is s3a://bucket-name/ArchiveTables/log/60a89657-47df-4df4-8aa3-4c53f43782b9/ , application is throwing error with statusts column not available in schema. after going though few threads on stack overflow, i added column statusts to dataset , so that column is present in stored Avro file, but i see that column is missing in Avro file.
if i change the folder_path variable to specific partition, s3a://bucket-name/ArchiveTables/log/60a89657-47df-4df4-8aa3-4c53f43782b9/statusts=2024-03-01/ , it works. but in that case i will be able to load the data from one partition at a time.
is there a way i can load all partitions in a single load rather than giving partition value also in the path to load
other question, i am using below code to load the data in database after loading the data in avro. i gave lower bound to date from where we have data stored, upper bound some random date, but i have data where statusts is null. in that case how can we provide lower or upper bound, if i don't want to miss that data also, while loading.
transactionIdNullDS.repartition(6).write().mode(SaveMode.Append).format("jdbc")
.option("stringtype", "unspecified").option("url", url).option("JDBCOptions.JDBC_DRIVER_CLASS", driver)
.option("dbtable", tableName).option("user", "username")
.option("password", "password")
.option("partitionColumn", "statusts").option("lowerBound", "2022-01-01")
.option("upperBound", "2999-12-31")
.option("numPartitions", Integer.parseInt(System.getProperty("numPartitions.number", "8"))).save();
