I'm working on a university project related to studying how Kubernetes scaling works but I have been facing an issue: I have a cluster with one master node (for control plane) and one edge node and I have enabled HPA on a deployment I created, named stress-app, below is the deployment and service YAML files, respectively:
stress-app-deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: stress-app
spec:
replicas: 1
selector:
matchLabels:
app: stress-app
template:
metadata:
labels:
app: stress-app
spec:
containers:
- name: stress-app
image: annis99/stress-app:v0.2
imagePullPolicy: Always
ports:
- containerPort: 8081
resources:
requests:
cpu: 0.5
memory: 100Mi
stress-app-service.yaml
apiVersion: v1
kind: Service
metadata:
name: stress-app-service
spec:
selector:
app: stress-app
ports:
- protocol: TCP
port: 8081
targetPort: 8081
type: LoadBalancer
The app is a simple API built with FastAPI with one endpoint '/workload/cpu', which when called create a fake CPU load using the stress-ng CLI tool.
The app works fine, however, when I try to load-test the cluster to see the HPA scaling, it scales up to 8 replicas sometimes, but only one replica of the pod handles most requests, causing high CPU spikes for the node hosting that replica. I have also checked that other replicas are running and not in a pending state.
I have noticed this when trying with JMeter, Locust and K6 load testing tools, by setting keep-alive to False, however, when I try with the browser (Google Chrome) by opening several tabs and requesting that endpoint simultaneously, more replicas are involved and process the requests resulting in a fairer distribution of the workload. Any ideas why this happens with sophisticated load-testing tools?