I have deployed around 23 models (amounting to 1.57 GB) in a Azure ML workspace using Azure Kubernetes Service. For the AKS cluster, I have used 3 D8sv3 nodes, and enabled cluster auto scaling for the cluster up to 6 nodes. The AksWebService is configured with 4.4 cores, 16 GB memory. I have enabled pod auto scaling for the Web service, having set autoscale_max_replicas at 40:
aks_config = AksWebservice.deploy_configuration(cpu_cores = 4.4, memory_gb = 16, autoscale_enabled = True,
description = 'TEST - Configuration for Kubernetes Compute Target',
enable_app_insights = True, max_request_wait_time = 25000,
autoscale_target_utilization = 0.6, autoscale_max_replicas = 40)
I tried running load tests with 10 concurrent users (using JMeter). And I monitored the cluster application insights:
I can see the nodes and pods scaling. However, there is no spike in CPU/memory utilization. For 10 concurrent requests, only 5 to 6 requests pass, the rest fail. When I send an individual request to the deployed endpoint, the response is generated in 7 to 9 seconds. However, in the load test logs, there are plenty requests taking more than 15 seconds to generate a response. And the requests taking more than 25 seconds, fail with status code 503. I increased the max_request_wait_time
due to this reason, however, I don't understand why it would take so much time despite such amount of compute, and the dashboard shows that memory isn't even 30% utilized. Should I be changing the replica_max_concurrent_requests
param? Or should I be increasing the autoscale_max_replicas
even more? Concurrent requests load may sometimes reach 100 in production, is there any solution to this?
Will be grateful for any advice. Thanks.