I used spark 2.4.0 with "org.apache.bahir - spark-sql-cloudant - 2.4.0" I have to download all json files from couchDB to hdfs.
val df = spark
.read
.format("org.apache.bahir.cloudant")
.load("demo")
df.persist(StorageLevel.MEMORY_AND_DISK)
df
.write
.partitionBy("year", "month", "day")
.mode("append")
.parquet("...")
Total file size is 160GB (> 13 millions files) Spark job after 5 minutes running gets error
Caused by: com.cloudant.client.org.lightcouch.CouchDbException: Error retrieving server response
Increasing the timeout does not help, falls off but later What are the ways out of the situation?
Use another endpoint for queries, use _changes against _all_docs helped me