I'm using python with pyarrow library and I'd like to write a pandas dataframe on HDFS. Here is the code I have
import pandas as pd
import pyarrow as pa
fs = pa.hdfs.connect(namenode, port, username, kerb_ticket)
df = pd.DataFrame(...)
table = pa.Table.from_pandas(df)
According to the documentation I should use the following code to write a pyarrow.Table on HDFS
import pyarrow.parquet as pq
pq.write_parquet(table, 'filename.parquet')
What I don't understand is where should I use my connection (fs), because if I don't use it in the write_parquet then how come it knows where the HDFS is?
Based on the document: https://arrow.apache.org/docs/python/api/formats.html#parquet-files
you can either use write_table or write_to_dataset function:
write_table
write_table takes multiple parameters, few of those are below:
Example
or
write_to_dataset
You can use write_to_dataset in case you want to partition data based on certain column in the table, Example: