How to read from Azure Blob Storage with Python delta-rs

2.2k views Asked by At

I'd like to use the Python bindings to delta-rs to read from my blob storage.

Currently I am kind of lost, since I cannot figure out how to configure the filesystem on my local machine. Where do I have to put my credentials?

Can I use adlfs for this?

from adlfs import AzureBlobFileSystem
    
fs = AzureBlobFileSystem(
        account_name="...", 
        account_key='...'
    )

and then use the fs object?

4

There are 4 answers

3
rtyler On BEST ANSWER

Unfortunately we don't have great documentation around this at the moment. You should be able to set AZURE_STORAGE_ACCOUNT and AZURE_STORAGE_SAS environment variables a la this integration test.

That will ensure the Python bindings can access table metadata, but typically fetching of the data for query is done through Pandas, and I'm not sure if Pandas will handle these variables as well (not an ADLSv2 user myself)..

2
Atharva Khutale On

I don't know about delta-rs but you can use this object directly with pandas.

abfs = AzureBlobFileSystem(account_name="account_name", account_key="access_key", container_name="name_of_container")
df = pd.read_parquet("path/of/file/with/container_name/included",filesystem=abfs)
0
Will W On

You can also use the storage_options e.g.

delta_url = f"{protocol}://{container_name}@{storage_account_name}.dfs.core.windows.net/{delta_path}"

# Give SAS_TOKEN as storage option (can be set via ENV variable as well)
storage_options = {"ACCESS_KEY": f"{access_key}"}

# Read the Delta table from the storage account
df = dl.DeltaTable(
    delta_url, storage_options=storage_options).to_pyarrow_table()

With the options being described here

0
Simen Holmestad On

One possible workaround is to download the delta lake files to a tmp-dir and read the files using python-delta-rs with something like this:

from azure.storage.blob import BlobServiceClient
import tempfile
from deltalake import DeltaTable

def get_blobs_for_folder(container_client, blob_storage_folder_path):
    blob_iter = container_client.list_blobs(name_starts_with=blob_storage_folder_path)
    blob_names = []
    for blob in blob_iter:
        if "." in blob.name:
            # To just get files and not directories, there might be a better way to do this
            blob_names.append(blob.name)

    return blob_names


def download_blob_files(container_client, blob_names, local_folder):
    for blob_name in blob_names:
        local_filename = os.path.join(local_folder, blob_name)
        local_file_dir = os.path.dirname(local_filename)
        if not os.path.exists(local_file_dir):
            os.makedirs(local_file_dir)

        with open(local_filename, 'wb') as f:
            f.write(container_client.download_blob(blob_name).readall())


def read_delta_lake_file_to_df(blob_storage_path, access_key):
    blob_storage_url = "https://your-blob-storage"
    blob_service_client = BlobServiceClient(blob_storage_url, credential=access_key)
    container_client = blob_service_client.get_container_client("your-container-name")

    blob_names = get_blobs_for_folder(container_client, blob_storage_path)
    with tempfile.TemporaryDirectory() as tmp_dirpath:
        download_blob_files(container_client, blob_names, tmp_dirpath)
        local_filename = os.path.join(tmp_dirpath, blob_storage_path)
        dt = DeltaTable(local_filename)
        df = dt.to_pyarrow_table().to_pandas()
    return df