I'm using the delta-rs package to store files on the Google Cloud Storage dual-region bucket. I use the following code to store the data:
def save_data(self, df: Generator[pa.RecordBatch, Any, None]):
write_deltalake(
f"gs://<my-bucket-name>",
df,
schema=df_schema,
partition_by="my_id",
mode="append",
max_rows_per_file=self.max_rows_per_file,
max_rows_per_group=self.max_rows_per_file,
min_rows_per_group=int(self.max_rows_per_file / 2)
)
The input data is a generator since I'm taking the data from a Postgres database in batches. I am saving similar data into two different tables and I'm also saving a SUCCESS file for each uploaded partition.
I have around 25,000 partitions and most of them only have a single parquet file in them. The total number of rows that I've inserted is around 700,000,000. This incurred the following costs:
- Class A operations: 127,000.
- Class B operations: 109,856,507.
- Download Worldwide Destinations: 300 gibibyte.
The number of class A operations makes sense to me when accounting for 2 writes per partition + an additional success file -- these are inserts. Some partitions probably have more than 1 file, so the number is a bit higher than 25,000 (number of partitions) x 3.
I can't figure out where so many class B operations and Download Worldwide Destinations. I assume it comes from the implementation of delta-rs.
Can you provide any insights into why the costs are so high and how I would need to change the code to decrease them?
Posting as community wiki as per comment of @JohnHanley:
It has also been mentioned by @gregorp that there's a significant decrease in Class B operations after creating a parquet file.