Register dataset in AzureML Python SDK v2 from a dataframe

Question

Register dataset in AzureML Python SDK v2 from a dataframe

652 views Asked by Vanaclocha At 13 November 2023 at 13:35

I am looking to upgrade to the new Azure AML Python SDK v2.

However, I can't replicate registering a dataset (now called data assets) directly from a dataframe. it seems that now the only option is to create a data asset from saved files (either local or adls).

In the SDK v1, I was just using Dataset.Tabular.register_pandas_dataframe

from azureml.core import Dataset

dataset = Dataset.Tabular.register_pandas_dataframe(
    dataframe=dataframe,
    target=target,
    name=name,
)

The documentation I found about the topic: Create a data asset: Table type Create a tabular dataset/data asset

Is there any way to register the dataset without having to store the parquet files in local?

I was thinking about writing the parquet files in adls and create a dataset from those files. However, I can't find on their documentation how to do so.

I have tried to upload the parquet files as:

adls_url = f"https://{account_name}.blob.core.windows.net"

# Connect to the Azure blob service
blob_service_client = BlobServiceClient(
        account_url=adls_url,
        credential=credential
    )
container_client = blob_service_client.get_container_client(
    container_name
)

# Prep data for saving
data = pd.DataFrame.to_parquet(df)

# Upload the dataframe to ADLS
container_client.upload_blob(
    output_path,
    data,
    overwrite=True,
    encoding='utf-8'
)

This works if I upload a csv. As a parquet I get: ResourceExistsError: The requested operation is not allowed in the current state of the entity.

Anyway, that would look like a workaround.

I guess as a last resort, I could still use azureml.core.Dataset. They do not recommend using v1 and v2 together, but there is backwards compatibility

Original Q&A

There are 1 answers

**JayashankarGS** · Accepted Answer · 2023-11-14T09:50:43+00:00

In Python SDK v2, below are the inputs you can use to register a dataset.

Location	Examples
A path on your local computer	./home/username/data/my_data
A path on a Datastore	azureml://datastores/<data_store_name>/paths/
A path on a public http(s) server	https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv
A path on Azure Storage	(Blob) wasbs://@.blob.core.windows.net/<path_to_data>/ (ADLS gen2) abfss://<file_system>@<account_name>.dfs.core.windows.net/ (ADLS gen1) adl://.azuredatalakestore.net/<path_to_data>/

With one of the below asset types.

type	API
File Reference a single file	uri_file
Folder Reference a folder	uri_folder
Table Reference a data table	mltable

So you need to save somewhere and register by using above ways.

My idea here is to save the pandas dataframe to a parquet file in the datastore, which is a storage account associated with your ML workspace, and use that Azure datastore path to register.

First, create a datastore in your ML workspace mounted to the azureml container.

enter image description here

Next, get the credentials of the storage account associated with your ML workspace and use the code below to save the parquet file in the above container.

jgsai4079545193 is my storage account.

df.to_parquet("abfs://[email protected]/mldata/mydata.parquet", storage_options={"connection_string":conn_str})

If you have different credentials, refer to this documentation and use appropriate storage_options.

This will create a parquet file inside the mldata folder.

enter image description here

Now, register this data by giving this path and the datastore mounted to the azureml container.

VERSION = "1"

path = "azureml://datastores/data_store_name/paths/mldata/"

my_data = Data(
    path=path,
    type=AssetTypes.URI_FOLDER,
    description="<ADD A DESCRIPTION HERE>",
    name="Parquet_data",
    version=VERSION,
)

dataref = ml_client.data.create_or_update(my_data)

The data asset is now registered.

enter image description here

To consume, use the code below.

data_asset = ml_client.data.get("Parquet_data", version="1")

path = {
  'folder': data_asset.path
}

tbl = mltable.from_parquet_files(paths=[path])
df = tbl.to_pandas_dataframe()
print(df)

Output:

	Name	Age	City	Salary
0	Alice	25	New York	60000
1	Bob	30	San Francisco	80000
2	Charlie	22	Los Angeles	55000
3	David	35	Seattle	90000
4	Eva	28	Chicago	70000

TechQA.

Register dataset in AzureML Python SDK v2 from a dataframe

There are 1 answers

Related Questions in AZURE-MACHINE-LEARNING-SERVICE

Related Questions in AZUREML-PYTHON-SDK

Related Questions in AZUREMLSDK

Popular Questions

Popular Tags

Trending Questions