Reading index based range from Parquet File using Python

1.6k views Asked by At

I'm trying to read a range of data (say row 1000 to 5000) from a parquet file. I've tried pandas with fastparquet engine and even pyarraw but can't seem to find any option to do so.

Is there any way to achieve this?

1

There are 1 answers

0
taras On

I don't think the current pyarrow version (2.0) supports it.

The closest you can get with your file slicing is by using filters argument of read_table.

filters (List[Tuple] or List[List[Tuple]] or None (default)) – Rows which do not match the filter predicate will be removed from scanned data.

Predicates are expressed in disjunctive normal form (DNF), like [[('x', '=', 0), > ...], ...]. DNF allows arbitrary boolean logical combinations of single column predicates

If your dataset has a column foo based on which you can get your required rows, use something like this:

import pyarrow.parquet as pq

table = pq.read_table(filename, filters=[('foo', '>', 0)])

If you happen to have a column id corresponding to the row index you can use

table = pq.read_table(filename, filters=[('id', '>', 1000), ('id', '<', 5000)])