Register dataset in AzureML Python SDK v2 from a dataframe

636 views Asked by At

I am looking to upgrade to the new Azure AML Python SDK v2.

However, I can't replicate registering a dataset (now called data assets) directly from a dataframe. it seems that now the only option is to create a data asset from saved files (either local or adls).

In the SDK v1, I was just using Dataset.Tabular.register_pandas_dataframe

from azureml.core import Dataset

dataset = Dataset.Tabular.register_pandas_dataframe(
    dataframe=dataframe,
    target=target,
    name=name,
)

The documentation I found about the topic: Create a data asset: Table type Create a tabular dataset/data asset

Is there any way to register the dataset without having to store the parquet files in local?

I was thinking about writing the parquet files in adls and create a dataset from those files. However, I can't find on their documentation how to do so.

I have tried to upload the parquet files as:

adls_url = f"https://{account_name}.blob.core.windows.net"

# Connect to the Azure blob service
blob_service_client = BlobServiceClient(
        account_url=adls_url,
        credential=credential
    )
container_client = blob_service_client.get_container_client(
    container_name
)

# Prep data for saving
data = pd.DataFrame.to_parquet(df)

# Upload the dataframe to ADLS
container_client.upload_blob(
    output_path,
    data,
    overwrite=True,
    encoding='utf-8'
)

This works if I upload a csv. As a parquet I get: ResourceExistsError: The requested operation is not allowed in the current state of the entity.

Anyway, that would look like a workaround.

I guess as a last resort, I could still use azureml.core.Dataset. They do not recommend using v1 and v2 together, but there is backwards compatibility

1

There are 1 answers

1
JayashankarGS On BEST ANSWER

In Python SDK v2, below are the inputs you can use to register a dataset.

Location Examples
A path on your local computer ./home/username/data/my_data
A path on a Datastore azureml://datastores/<data_store_name>/paths/
A path on a public http(s) server https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv
A path on Azure Storage (Blob) wasbs://@.blob.core.windows.net/<path_to_data>/ (ADLS gen2) abfss://<file_system>@<account_name>.dfs.core.windows.net/ (ADLS gen1) adl://.azuredatalakestore.net/<path_to_data>/

With one of the below asset types.

type API
File Reference a single file uri_file
Folder Reference a folder uri_folder
Table Reference a data table mltable

So you need to save somewhere and register by using above ways.

My idea here is to save the pandas dataframe to a parquet file in the datastore, which is a storage account associated with your ML workspace, and use that Azure datastore path to register.

First, create a datastore in your ML workspace mounted to the azureml container.

enter image description here

Next, get the credentials of the storage account associated with your ML workspace and use the code below to save the parquet file in the above container.

jgsai4079545193 is my storage account.

df.to_parquet("abfs://[email protected]/mldata/mydata.parquet", storage_options={"connection_string":conn_str})

If you have different credentials, refer to this documentation and use appropriate storage_options.

This will create a parquet file inside the mldata folder.

enter image description here

Now, register this data by giving this path and the datastore mounted to the azureml container.

VERSION = "1"

path = "azureml://datastores/data_store_name/paths/mldata/"

my_data = Data(
    path=path,
    type=AssetTypes.URI_FOLDER,
    description="<ADD A DESCRIPTION HERE>",
    name="Parquet_data",
    version=VERSION,
)

dataref = ml_client.data.create_or_update(my_data)

The data asset is now registered.

enter image description here

To consume, use the code below.

data_asset = ml_client.data.get("Parquet_data", version="1")

path = {
  'folder': data_asset.path
}

tbl = mltable.from_parquet_files(paths=[path])
df = tbl.to_pandas_dataframe()
print(df)

Output:

Name Age City Salary
0 Alice 25 New York 60000
1 Bob 30 San Francisco 80000
2 Charlie 22 Los Angeles 55000
3 David 35 Seattle 90000
4 Eva 28 Chicago 70000