How can I achieve predicate pushdown when using PyArrow + Parquet + Google Cloud Storage?

Question

How can I achieve predicate pushdown when using PyArrow + Parquet + Google Cloud Storage?

777 views Asked by user5406764 At 21 April 2021 at 17:28

What I'm really trying to do is this (in Python):

import pyarrow.parquet as pq

# Note the 'columns' predicate...
table = pq.read_table('gs://my_bucket/my_blob.parquet', columns=['a', 'b', 'c'])

First, I don't think that gs:// is supported in PyArrow as of V3.0.0. So I have to modify the code to use the fsspec interface: https://arrow.apache.org/docs/python/filesystems.html

import pyarrow.parquet as pq
import gcsfs

fs = gcsfs.GCSFileSystem(project='my-google-project')
with fs.open('my_bucket/my_blob.parquet', 'rb') as file:
    table = pq.read_table(file.read(), columns=['a', 'b', 'c'])

Does this achieve predicate pushdown (I doubt it, because I'm already readying the whole file with file.read()), or is there a better way to get there?

Original Q&A

There are 1 answers

**Pace** · Accepted Answer · 2021-04-21T17:46:57+00:00

Pace On 21 April 2021 at 17:46 BEST ANSWER

Does this work?

import pyarrow.parquet as pq
import gcsfs

fs = gcsfs.GCSFileSystem(project='my-google-project')
table = pq.read_table('gs://my_bucket/my_blob.parquet', columns=['a', 'b', 'c'], filesystem=fs)

TechQA.

How can I achieve predicate pushdown when using PyArrow + Parquet + Google Cloud Storage?

There are 1 answers

Related Questions in GOOGLE-CLOUD-STORAGE

Related Questions in PARQUET

Related Questions in PYARROW

Related Questions in APACHE-ARROW

Related Questions in GCSFUSE

Popular Questions

Popular Tags

Trending Questions