DataFusion (Apache Arrow): How to lazily read batches of result?

697 views Asked by At

I have a datafusion query. Instead of waiting from all batchs to be processed, I would like to run some code as soon as the first batch is ready.

Here is the await and then process code:

let dataframe = ExecutionContext::new().read_parquet(filename)?;
let batchs = dataframe.collect().await?;

for batch in batchs {
    // Do something with the record batch
    println!("{:?}", batch.schema());
}

I would like something that return me not a promise of an array of BatchRecord, but more an array of promise of BatchRecord. Does DataFusion provide a way to only retrieve the first batch without having to wait for the full processing of the parquet file?

I have currently a 5+min loading time at startup and this is just not practical. Directly using Arrow & Parquet would allow me to access the first batch right away (with a trade of in api/features).

Edit: A minimal example can be found in the DataFusion git repository

1

There are 1 answers

0
Andy Grove On

There have been some recent changes in the master branch since the 2.0.0 release to better support async and streaming so it would be worth checking the latest code, but the DataFrame collect method does load all results into memory before returning so this might not be the best approach.

It would probably be a good idea to ask about this on the Arrow mailing list as well.