I have a delta table in hdfs stored as a hive table. I need to connect to the table and load the latest version of the table. I was able to connect to hdfs using pyarrow library. But it is loading entire versions on the hdfs. Here is my code
import pyarrow as pa
import pyarrow.fs as fs
import pyarrow.parquet as pq
import pyarrow.dataset as ds
fs = pa.hdfs.connect(host=ip,port=port)
dt = fs.read_parquet('/path/to/file/in/hdfs')
dt.to_pandas()
But here I am getting entire historical data in the table. Is there an option to specify that I am loading a delta table in pyarrow?
Another approach I tried is using deltalake library. Here I was not able to connect to a hdfs location. Please check the code below
from deltalake import DeltaTable
table_path_hdfs ="hdfs://ip:port/path/to/file/in/hdfs"
dt = DeltaTable(table_path_hdfs)
*While running the code I am getting error deltalake.PyDeltaTableError: Delta-rs must be build with feature 'hdfs' to support loading from: hdfs://
Is there way we can load delta-rs with hdfs support?
Can anybody suggest any other libraries for this?