I am looking to upgrade to the new Azure AML Python SDK v2.
However, I can't replicate registering a dataset (now called data assets) directly from a dataframe. it seems that now the only option is to create a data asset from saved files (either local or adls).
In the SDK v1, I was just using Dataset.Tabular.register_pandas_dataframe
from azureml.core import Dataset
dataset = Dataset.Tabular.register_pandas_dataframe(
dataframe=dataframe,
target=target,
name=name,
)
The documentation I found about the topic: Create a data asset: Table type Create a tabular dataset/data asset
Is there any way to register the dataset without having to store the parquet files in local?
I was thinking about writing the parquet files in adls and create a dataset from those files. However, I can't find on their documentation how to do so.
I have tried to upload the parquet files as:
adls_url = f"https://{account_name}.blob.core.windows.net"
# Connect to the Azure blob service
blob_service_client = BlobServiceClient(
account_url=adls_url,
credential=credential
)
container_client = blob_service_client.get_container_client(
container_name
)
# Prep data for saving
data = pd.DataFrame.to_parquet(df)
# Upload the dataframe to ADLS
container_client.upload_blob(
output_path,
data,
overwrite=True,
encoding='utf-8'
)
This works if I upload a csv. As a parquet I get: ResourceExistsError: The requested operation is not allowed in the current state of the entity.
Anyway, that would look like a workaround.
I guess as a last resort, I could still use azureml.core.Dataset. They do not recommend using v1 and v2 together, but there is backwards compatibility
In Python SDK v2, below are the inputs you can use to register a dataset.
With one of the below asset types.
So you need to save somewhere and register by using above ways.
My idea here is to save the pandas dataframe to a parquet file in the datastore, which is a storage account associated with your ML workspace, and use that Azure datastore path to register.
First, create a datastore in your ML workspace mounted to the
azureml
container.Next, get the credentials of the storage account associated with your ML workspace and use the code below to save the parquet file in the above container.
jgsai4079545193
is my storage account.If you have different credentials, refer to this documentation and use appropriate
storage_options
.This will create a parquet file inside the
mldata
folder.Now, register this data by giving this path and the datastore mounted to the
azureml
container.The data asset is now registered.
To consume, use the code below.
Output: