TechQA.

Large data with Spark and CouchDB

266 views Asked by Oleg At 24 February 2021 at 19:31

I used spark 2.4.0 with "org.apache.bahir - spark-sql-cloudant - 2.4.0" I have to download all json files from couchDB to hdfs.

 val df = spark
  .read
  .format("org.apache.bahir.cloudant")
  .load("demo")
df.persist(StorageLevel.MEMORY_AND_DISK)

 df
  .write
  .partitionBy("year", "month", "day")
  .mode("append")
  .parquet("...")

Total file size is 160GB (> 13 millions files) Spark job after 5 minutes running gets error

Caused by: com.cloudant.client.org.lightcouch.CouchDbException: Error retrieving server response

Increasing the timeout does not help, falls off but later What are the ways out of the situation?

There are 1 answers

Oleg

Oleg On 25 February 2021 at 14:43

Use another endpoint for queries, use _changes against _all_docs helped me