Is there a way to read files using arrow from the remote server in c++?

743 views Asked by At

Reading CSV or Parquet files from local fs is very easy, but it seems that arrow does not support reading files from a remote server given its ip. Is there a way to achieve this? e.g. read a subset columns of a Parquet file from a remote server (path is like "ip://path/to/remote/file"). Thanks.

2

There are 2 answers

3
li.davidm On

There is an open issue for this if you would like to contribute or follow development: https://issues.apache.org/jira/browse/ARROW-7594

(By 'remote server' I assume you mean over HTTP(s) or similar. If you're looking for a custom client-server protocol, check out Arrow Flight.)

0
Daniel Darabos On

pyarrow.dataset.dataset() has a filesystem argument through which it supports many remote file systems.

See the Arrow documentation for file systems. An fsspec file system can also be passed in, of which there are very many.

For example, if your Parquet file is sitting on a web server, you could use the fsspec HTTP file system:

import pyarrow.dataset as ds                                                                                                                                                          
import fsspec.implementations.http
http = fsspec.implementations.http.HTTPFileSystem()
d = ds.dataset('http://localhost:8000/test.parquet', filesystem=http)