Task: To create a model and deploy it using Azure ML Endpoints
Please do not focus on what the model does, rather the goal here is to deploy it.
Steps that I took in the Azure ML workspace
In the environments section, I created a custom environment that has the needed dependencies.
In the compute section, I created a custom compute instance with the below details
Virtual machine type: CPU
Virtual machine size: Standard_D14 (16 cores, 112 GB RAM, 800 GB disk)
Auto shut down: Enabled
In the security section, all default values, so User assignment, Assigned identity and SSH all were disabled by default. The virtual network and subnet were selected by default and pointed to the one that was for the workspace.
There was a warning though in the virtual network which said "Your workspace is linked to a virtual network using a private endpoint connection. In order to communicate properly with the workspace, your compute resource must be provisioned in the same virtual network."
For the application section again the defaults so no creation script or startup script was given.I created a new notebook that has the code for creating a model, saving the model in Azure Models and then creating an Endpoint and finally deploying the model in the endpoint. I was successfully able to create a model, an endpoint, retrieve the model for deployment but kept failing while deploying it. The error that I got was:
HttpResponseError: (None) BadArgument: Startup task failed due to authorization error. Please see troubleshooting guide, available here: https://aka.ms/oe-tsg#error-badargument
Code: None
Message: BadArgument: Startup task failed due to authorization error. Please see troubleshooting guide, available here: https://aka.ms/oe-tsg#error-badargument
Exception Details: (None) BadArgument: Startup task failed due to authorization error. Please see troubleshooting guide, available here: https://aka.ms/oe-tsg#error-badargument
Code: None
Message: BadArgument: Startup task failed due to authorization error. Please see troubleshooting guide, available here: https://aka.ms/oe-tsg#error-badargument
For reference: I am following this article: Getting started with Azure Machine Learning
I follow the exact steps, except that the main script and the command is slightly different for me. The main script and the command looks like below.
%%writefile {train_src_dir}/main.py
from azure.storage.blob import BlobServiceClient
from bertopic import BERTopic
import pandas as pd
import argparse
import mlflow
import os
import mlflow.sklearn
from io import BytesIO, TextIOWrapper
def main():
"""Main function of the script."""
# input and output parameters
parser = argparse.ArgumentParser()
parser.add_argument("--registered_model_name", type=str, help="model name")
args = parser.parse_args()
# account parameters
account_name = "xxx"
account_key = "xxx"
container_name = "xyz"
connect_str = 'DefaultEndpointsProtocol=https;AccountName=' + account_name + ';AccountKey=' + account_key + ';EndpointSuffix=core.windows.net'
# Start Logging
mlflow.start_run()
# enable autologging
mlflow.sklearn.autolog()
print("Accessing Azure Blob Storage")
# Create a client to interact with the blob storage
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
# Use the client to connect to the container
container_client = blob_service_client.get_container_client("xyz")
# Limiting folders for a temporary solution
accepted_dir = ['files/f01/p0003', 'files/f01/p0004', 'files/f01/p0005', 'files/f01/p0006']
file_names = []
# Prepare data
for blob_i in container_client.list_blobs():
if accepted_dir[0] in blob_i.name:
file_names.append(blob_i.name)
elif accepted_dir[1] in blob_i.name:
file_names.append(blob_i.name)
elif accepted_dir[2] in blob_i.name:
file_names.append(blob_i.name)
elif accepted_dir[3] in blob_i.name:
file_names.append(blob_i.name)
else:
break
print(f"Number of files fetched from storage account {len(file_names)}")
mlflow.log_metric('Number of files fetched from storage account', len(file_names))
file_contents = []
for blob in file_names:
with open("temp_file.txt", "wb") as f:
blob_client = container_client.get_blob_client(blob=blob)
downloader = blob_client.download_blob(max_concurrency=1, encoding='UTF-8')
file_contents.append(downloader.readall())
print(f"Number of files read {len(file_contents)}")
mlflow.log_metric('Number of files read', len(file_contents))
# Convert file contents array to pandas dataframe
df = pd.DataFrame([file_names, file_contents])
df = df.transpose()
df.columns = ['file_name', 'raw_text']
# Create BERT model
model = BERTopic(language="english", calculate_probabilities=False, verbose=True)
topics, probs = model.fit_transform(file_contents)
# print topic and probability
for i in range(len(file_contents)):
print(f'File:[{df.loc[i].at["file_name"]}], Topic: [{topics[i]}], Probability: [{probs[i]}]')
# print topic information
print(model.get_topic_info())
# Registering the model to the workspace
print("Registering the model via MLFlow")
mlflow.sklearn.log_model(
sk_model=model,
registered_model_name=args.registered_model_name,
artifact_path=args.registered_model_name,
)
# Saving the model to a file
mlflow.sklearn.save_model(
sk_model=model,
path=os.path.join(args.registered_model_name, "trained_model"),
)
# Stop Logging
mlflow.end_run()
if __name__ == "__main__":
main()
Job command
from azure.ai.ml import command
from azure.ai.ml import Input
registered_model_name = "topic_generation_model"
job = command(
inputs=dict(
registered_model_name=registered_model_name,
),
code="./src/", # location of source code
command="python main.py --registered_model_name ${{inputs.registered_model_name}}",
environment="custom_env_name@latest",
display_name="topic_generation_job",
compute="custom_compute_name",
experiment_name="topic_generation_exp"
)
Job runs successfully, then I proceed to the next steps for creating a model which is a success. Then I proceed to deploying the model to the endpoint which fails with error that I have specified above.
Deployment code
# picking the model to deploy. Here we use the latest version of our registered model
model = ml_client.models.get(name=registered_model_name, version=latest_model_version)
# Expect this deployment to take approximately 6 to 8 minutes.
# create an online deployment.
blue_deployment = ManagedOnlineDeployment(
name="blue",
endpoint_name=online_endpoint_name,
model=model,
instance_type="Standard_DS3_v2",
instance_count=1,
)
blue_deployment = ml_client.begin_create_or_update(blue_deployment).result()
I have no idea how to fix this. Any suggestions would be highly appreciated. Is this an identity issue, or network issue, I am not able to figure this out.
I also followed the link the error says: Authorization error but to whom and what permissions do I need to give? Do i need to give myself some permissions?
Thanks
Update:
Seems like it was a network connectivity issue with the storage account. When I enabled public access I was able to download the model while deployment and that error was resolved. But now I have a new error which is
ResourceOperationFailure: ResourceNotReady: User container has crashed or terminated: Liveness probe failed: HTTP probe failed with statuscode: 502. Please see troubleshooting guide, available here: https://aka.ms/oe-tsg#error-resourcenotready