Reading index based range from Parquet File using Python

Question

Reading index based range from Parquet File using Python

1.6k views Asked by MetalMonkey At 22 November 2020 at 13:19

I'm trying to read a range of data (say row 1000 to 5000) from a parquet file. I've tried pandas with fastparquet engine and even pyarraw but can't seem to find any option to do so.

Is there any way to achieve this?

Original Q&A

There are 1 answers

**taras** · Answer 1 · 2020-11-22T14:08:33+00:00

I don't think the current pyarrow version (2.0) supports it.

The closest you can get with your file slicing is by using filters argument of read_table.

filters (List[Tuple] or List[List[Tuple]] or None (default)) – Rows which do not match the filter predicate will be removed from scanned data.

Predicates are expressed in disjunctive normal form (DNF), like [[('x', '=', 0), > ...], ...]. DNF allows arbitrary boolean logical combinations of single column predicates

If your dataset has a column foo based on which you can get your required rows, use something like this:

import pyarrow.parquet as pq

table = pq.read_table(filename, filters=[('foo', '>', 0)])

If you happen to have a column id corresponding to the row index you can use

table = pq.read_table(filename, filters=[('id', '>', 1000), ('id', '<', 5000)])

TechQA.

Reading index based range from Parquet File using Python

There are 1 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in PARQUET

Related Questions in PYARROW

Related Questions in FASTPARQUET

Popular Questions

Popular Tags

Trending Questions