delta-rs package incurs high costs on GCS

111 views Asked by At

I'm using the delta-rs package to store files on the Google Cloud Storage dual-region bucket. I use the following code to store the data:

    def save_data(self, df: Generator[pa.RecordBatch, Any, None]):
        write_deltalake(
            f"gs://<my-bucket-name>",
            df,
            schema=df_schema,
            partition_by="my_id",
            mode="append",
            max_rows_per_file=self.max_rows_per_file,
            max_rows_per_group=self.max_rows_per_file,
            min_rows_per_group=int(self.max_rows_per_file / 2)
        )

The input data is a generator since I'm taking the data from a Postgres database in batches. I am saving similar data into two different tables and I'm also saving a SUCCESS file for each uploaded partition.

I have around 25,000 partitions and most of them only have a single parquet file in them. The total number of rows that I've inserted is around 700,000,000. This incurred the following costs:

  • Class A operations: 127,000.
  • Class B operations: 109,856,507.
  • Download Worldwide Destinations: 300 gibibyte.

The number of class A operations makes sense to me when accounting for 2 writes per partition + an additional success file -- these are inserts. Some partitions probably have more than 1 file, so the number is a bit higher than 25,000 (number of partitions) x 3.

I can't figure out where so many class B operations and Download Worldwide Destinations. I assume it comes from the implementation of delta-rs.

Can you provide any insights into why the costs are so high and how I would need to change the code to decrease them?

1

There are 1 answers

1
Robert G On BEST ANSWER

Posting as community wiki as per comment of @JohnHanley:

I would repost your question at the project's issues github.com/delta-io/delta-rs/issues. We can help you with software coding problems, but not with the design of third-party products. Ask the people who wrote it why it behaves the way it does. My guess is that their software is constantly fetching details on the bucket, objects, and metadata without caching.

It has also been mentioned by @gregorp that there's a significant decrease in Class B operations after creating a parquet file.

I changed the implementation to creating a parquet file locally and uploading it to GCS and it seems the high amount of class B operations and download decreased substantially.