We manage our own ClearML server, on an EC2 instance AWS cloud. Instance type: t3.xlarge (4 vCPUs, 16 GiB Memory). Data disk: gp3 (size: 200 GB, IOPS: 3,000, Throughput: 125).
We have 3 ClearML projects, one with 643,000 experiments, another with 151,000 and the small one with 25,000. Total experiments in all projects: 819,000
ClearMLwebapp is very slow. For example, it takes about 30 seconds just to load the main dashboard. Searching a specific experiment by ID is also very slow.
What can we do to improve the performance?
We tried to add more memory, and it improved the performance, but only a little. It is still to slow.
Disclaimer: I'm a member of the ClearML team (formerly Trains)
I think your issue is simply caused by the number of serving processes in the server's apiserver component (probably 1 process at the moment).
Assuming you are using the docker-compose deployment of ClearML Server, in order to increase the number of processes add the
CLEARML_USE_GUNICORN=1
environment variable to theapiserver
service.This would run the apiserver component with 8 processes by default. To specify a different number of processes, add the
CLEARML_GUNICORN_WORKERS=12
environment variable (for 12 processes, for example).Please note that this mode (and of course, more processes) required more CPU and RAM resources. I believe your current setup should be enough for 8 processes, but I would recommend to monitor the machine's CPU and RAM usage and upgrade as required.