Best way to host multiple pytorch model files for inference?

394 views Asked by Drew Scatterday At 31 October 2023 at 04:00

Context:

I'm working with an end to end deep learning TTS framework (you give it text input it gives you a wav object back)
I've created a FastAPI endpoint in a docker container that uses the TTS framework to do inference
My frontend client will hit this FastAPI endpoint to do inference on a GPU server
I'm going to have multiple docker containers behind a load balancer (haproxy) all running the same FastAPI endpoint image

Storage Choice: What is the recommended approach for hosting model files when deploying multiple Docker containers? Should I use Docker volumes, or is it advisable to utilize cloud storage solutions like S3 or Digital Ocean Spaces for centralized model storage?
Latency Concerns: How can I minimize latency when fetching models from cloud storage? Are there specific techniques or optimizations (caching, partial downloads, etc.) that can be implemented to reduce the impact of latency, especially when switching between different models for inference?

I'm still learning about mlops so I appreciate any help.