Unable to deploy a model with Azure endpoints in Azure ML studio

255 views Asked by At

Task: To create a model and deploy it using Azure ML Endpoints

Please do not focus on what the model does, rather the goal here is to deploy it.

Steps that I took in the Azure ML workspace

  1. In the environments section, I created a custom environment that has the needed dependencies.

  2. In the compute section, I created a custom compute instance with the below details
    Virtual machine type: CPU
    Virtual machine size: Standard_D14 (16 cores, 112 GB RAM, 800 GB disk)
    Auto shut down: Enabled
    In the security section, all default values, so User assignment, Assigned identity and SSH all were disabled by default. The virtual network and subnet were selected by default and pointed to the one that was for the workspace.

    There was a warning though in the virtual network which said "Your workspace is linked to a virtual network using a private endpoint connection. In order to communicate properly with the workspace, your compute resource must be provisioned in the same virtual network."

    For the application section again the defaults so no creation script or startup script was given.

  3. I created a new notebook that has the code for creating a model, saving the model in Azure Models and then creating an Endpoint and finally deploying the model in the endpoint. I was successfully able to create a model, an endpoint, retrieve the model for deployment but kept failing while deploying it. The error that I got was:

HttpResponseError: (None) BadArgument: Startup task failed due to authorization error. Please see troubleshooting guide, available here: https://aka.ms/oe-tsg#error-badargument
Code: None
Message: BadArgument: Startup task failed due to authorization error. Please see troubleshooting guide, available here: https://aka.ms/oe-tsg#error-badargument
Exception Details:  (None) BadArgument: Startup task failed due to authorization error. Please see troubleshooting guide, available here: https://aka.ms/oe-tsg#error-badargument
    Code: None
    Message: BadArgument: Startup task failed due to authorization error. Please see troubleshooting guide, available here: https://aka.ms/oe-tsg#error-badargument

For reference: I am following this article: Getting started with Azure Machine Learning

I follow the exact steps, except that the main script and the command is slightly different for me. The main script and the command looks like below.

%%writefile {train_src_dir}/main.py

from azure.storage.blob import BlobServiceClient
from bertopic import BERTopic
import pandas as pd
import argparse
import mlflow
import os
import mlflow.sklearn
from io import BytesIO, TextIOWrapper

def main():
    """Main function of the script."""

    # input and output parameters
    parser = argparse.ArgumentParser()
    parser.add_argument("--registered_model_name", type=str, help="model name")
    args = parser.parse_args()

    # account parameters
    account_name = "xxx"
    account_key = "xxx"
    container_name = "xyz"
    connect_str = 'DefaultEndpointsProtocol=https;AccountName=' + account_name + ';AccountKey=' + account_key + ';EndpointSuffix=core.windows.net'

    # Start Logging
    mlflow.start_run()

    # enable autologging
    mlflow.sklearn.autolog()

    print("Accessing Azure Blob Storage")

    # Create a client to interact with the blob storage
    blob_service_client = BlobServiceClient.from_connection_string(connect_str)

    # Use the client to connect to the container
    container_client = blob_service_client.get_container_client("xyz")

    # Limiting folders for a temporary solution
    accepted_dir = ['files/f01/p0003', 'files/f01/p0004', 'files/f01/p0005', 'files/f01/p0006']
    file_names = []

    # Prepare data
    for blob_i in container_client.list_blobs():
        if accepted_dir[0] in blob_i.name:
            file_names.append(blob_i.name)
        elif accepted_dir[1] in blob_i.name:
            file_names.append(blob_i.name)
        elif accepted_dir[2] in blob_i.name:
            file_names.append(blob_i.name)
        elif accepted_dir[3] in blob_i.name:
            file_names.append(blob_i.name)
        else:
            break
    
    print(f"Number of files fetched from storage account {len(file_names)}")
    mlflow.log_metric('Number of files fetched from storage account', len(file_names))

    file_contents = []
    for blob in file_names:
        with open("temp_file.txt", "wb") as f:
            blob_client = container_client.get_blob_client(blob=blob)
            downloader = blob_client.download_blob(max_concurrency=1, encoding='UTF-8')
            file_contents.append(downloader.readall())
                

    print(f"Number of files read {len(file_contents)}")
    mlflow.log_metric('Number of files read', len(file_contents))

    # Convert file contents array to pandas dataframe
    df = pd.DataFrame([file_names, file_contents])
    df = df.transpose()
    df.columns = ['file_name', 'raw_text']

    # Create BERT model
    model = BERTopic(language="english", calculate_probabilities=False, verbose=True)
    topics, probs = model.fit_transform(file_contents)

    # print topic and probability
    for i in range(len(file_contents)):
        print(f'File:[{df.loc[i].at["file_name"]}], Topic: [{topics[i]}], Probability: [{probs[i]}]')
    
    # print topic information
    print(model.get_topic_info())

    # Registering the model to the workspace
    print("Registering the model via MLFlow")
    mlflow.sklearn.log_model(
        sk_model=model,
        registered_model_name=args.registered_model_name,
        artifact_path=args.registered_model_name,
    )

    # Saving the model to a file
    mlflow.sklearn.save_model(
        sk_model=model,
        path=os.path.join(args.registered_model_name, "trained_model"),
    )

    # Stop Logging
    mlflow.end_run()

if __name__ == "__main__":
    main()

Job command

from azure.ai.ml import command
from azure.ai.ml import Input

registered_model_name = "topic_generation_model"

job = command(
    inputs=dict(
        registered_model_name=registered_model_name,
    ),
    code="./src/",  # location of source code
    command="python main.py --registered_model_name ${{inputs.registered_model_name}}",
    environment="custom_env_name@latest",
    display_name="topic_generation_job",
    compute="custom_compute_name",
    experiment_name="topic_generation_exp"
)

Job runs successfully, then I proceed to the next steps for creating a model which is a success. Then I proceed to deploying the model to the endpoint which fails with error that I have specified above.

Deployment code

# picking the model to deploy. Here we use the latest version of our registered model
model = ml_client.models.get(name=registered_model_name, version=latest_model_version)

# Expect this deployment to take approximately 6 to 8 minutes.
# create an online deployment.
blue_deployment = ManagedOnlineDeployment(
    name="blue",
    endpoint_name=online_endpoint_name,
    model=model,
    instance_type="Standard_DS3_v2",
    instance_count=1,
)

blue_deployment = ml_client.begin_create_or_update(blue_deployment).result()

I have no idea how to fix this. Any suggestions would be highly appreciated. Is this an identity issue, or network issue, I am not able to figure this out.

I also followed the link the error says: Authorization error but to whom and what permissions do I need to give? Do i need to give myself some permissions?

Thanks


Update:

Seems like it was a network connectivity issue with the storage account. When I enabled public access I was able to download the model while deployment and that error was resolved. But now I have a new error which is

ResourceOperationFailure: ResourceNotReady: User container has crashed or terminated: Liveness probe failed: HTTP probe failed with statuscode: 502. Please see troubleshooting guide, available here: https://aka.ms/oe-tsg#error-resourcenotready 
0

There are 0 answers