java.lang.OutOfMemoryError on Data Ingestion and Storage Pipeline

47 views Asked by Harsh At 04 January 2024 at 08:01

I have around 25GBs of data in my Azure storage. I am performing data ingestion using Autoloader in databricks. Below are the steps I am performing:

Setting the enableChangeDataFeed as true.
Reading the complete raw data using readStream.
Writing as delta table using writeStream to the azure blob storage.
Reading the change feed data of this delta table using spark.read.format("delta").option("readChangeFeed", "true")...
Performing operations on the change feed table using withColumn (Performing operations on content column as well, might be taking a lot of computation).

Now I am trying to save this computed pyspark dataframe to my catalog but getting the error: java.lang.OutOfMemoryError. My databricks cluster has 1 Driver with 16GBs of memory and 4 nodes, max of 10 Workers with 16GBs of memory and 4 nodes each.

Is there a need to include more resources to the cluster or is there some way to optimize or replace the current pipeline?

Original Q&A

TechQA.

java.lang.OutOfMemoryError on Data Ingestion and Storage Pipeline

There are 0 answers

Related Questions in PYSPARK

Related Questions in OPTIMIZATION

Related Questions in OUT-OF-MEMORY

Related Questions in DATABRICKS

Related Questions in AUTOLOAD

Popular Questions

Trending Questions