I'm setting up a kubernetes cluster with many different components for our application stack and I'm trying to balance storage requirements while minimizing the number of components.
We have a web scraper that downloads tens of thousands of HTML files (and maybe PDFs) every day and I want to store these somewhere (along with some JSON metadata). I want the files stored in a redundant scalable way but having millions of small files seems like a bad fit with e.g. GlusterFS.
At the same time we have some very large binary files used by our system (several gigabytes large) and also probably many smaller binary files (10's of MBs). These do not seem like a good fit for any distribtued NoSQL DB like MongoDB.
So I'm considering using MongoDB + GlusterFS to separately address these two needs but I would rather reduce the number of moving pieces and just use one system. I have also read various warnings about using GlusterFS without e.g. Redhat support (which we definitely will not have).
Can anyone recommend an alternative? I am looking for something that is a distributed binary object store which is easy to setup/maintain and supports both small and large files. One advantage of our setup is that files will rarely ever be updated or deleted (just written and then read) and we don't even need indexing (that will be handled separately by elasticsearch) or high speed access for reads.
Are you in a cloud? If in AWS S3 would be a good spot, object storage sounds like what you might want, but not sure of your requirements.
If not in a cloud, you could run Minio (https://www.minio.io/) which would give you the same type of object storage that s3 would give you.
I do something similar now where I store binary documents in MongoDB and we back the nodes with EBS volumes.