How to write on HDFS using pyarrow

Question

How to write on HDFS using pyarrow

7.8k views Asked by HHH At 14 August 2019 at 18:36

I'm using python with pyarrow library and I'd like to write a pandas dataframe on HDFS. Here is the code I have

import pandas as pd
import pyarrow as pa

fs = pa.hdfs.connect(namenode, port, username, kerb_ticket)
df = pd.DataFrame(...)
table = pa.Table.from_pandas(df)

According to the documentation I should use the following code to write a pyarrow.Table on HDFS

import pyarrow.parquet as pq
pq.write_parquet(table, 'filename.parquet')

What I don't understand is where should I use my connection (fs), because if I don't use it in the write_parquet then how come it knows where the HDFS is?

Original Q&A

There are 2 answers

Wes McKinney On 14 August 2019 at 20:07

You can do this

with fs.open(path, 'wb') as f:
   pq.write_parquet(table, f)

I opened a JIRA about adding some more documentation about this

https://issues.apache.org/jira/browse/ARROW-6239

**Vivek Paradkar** · Accepted Answer · 2019-08-14T21:08:45+00:00

Based on the document: https://arrow.apache.org/docs/python/api/formats.html#parquet-files

you can either use write_table or write_to_dataset function:

write_table

write_table takes multiple parameters, few of those are below:

table -> pyarrow.Table
where -> this can be a string or the filesystem object
filesystem -> Default is None

Example

pq.write_table(table, path, filesystem = fs)

or

with fs.open(path, 'wb') as f:
    pq.write_table(table, f)

write_to_dataset

You can use write_to_dataset in case you want to partition data based on certain column in the table, Example:

pq.write_to_dataset(table, path, filesystem = fs, partition_cols = [col1])

TechQA.

How to write on HDFS using pyarrow

There are 2 answers

Related Questions in HDFS

Related Questions in PYARROW

Related Questions in LIBHDFS

Popular Questions

Trending Questions