Connect to delta table in hdfs using python without using Pyspark

327 views Asked by At

I have a delta table in hdfs stored as a hive table. I need to connect to the table and load the latest version of the table. I was able to connect to hdfs using pyarrow library. But it is loading entire versions on the hdfs. Here is my code

import pyarrow as pa
import pyarrow.fs as fs
import pyarrow.parquet as pq
import pyarrow.dataset as ds

fs = pa.hdfs.connect(host=ip,port=port)
dt = fs.read_parquet('/path/to/file/in/hdfs')
dt.to_pandas()

But here I am getting entire historical data in the table. Is there an option to specify that I am loading a delta table in pyarrow?

Another approach I tried is using deltalake library. Here I was not able to connect to a hdfs location. Please check the code below

from deltalake import DeltaTable
table_path_hdfs ="hdfs://ip:port/path/to/file/in/hdfs"
dt = DeltaTable(table_path_hdfs)

*While running the code I am getting error deltalake.PyDeltaTableError: Delta-rs must be build with feature 'hdfs' to support loading from: hdfs://

Is there way we can load delta-rs with hdfs support?

Can anybody suggest any other libraries for this?

0

There are 0 answers