How to manage very large Solr indexes

542 views Asked by At

I'm trying to plan a SolrCloud implementation, and given current index sizes from testing, my estimated physical index size for 1 billion documents is roughly 20 terabytes. So far, I've been unable to find a cloud host that can support a single volume of this size. I was hoping somebody could provide some guidance with regard to managing an index this large. Is a 20TB index absurd? Is there something I'm missing with regard to SolrCloud architecture? Most of the guidelines I've seen indicate that the entire index, regardless of shard count, should be replicated on every machine to guarantee redundancy, so every node would require a 20TB storage device. If there's anyone out there who can shed some light, I would greatly appreciate it.

1

There are 1 answers

4
Persimmonium On

Not sure where you read such guidelines?

It is totally normal to keep only a portion of the index in each shard (each shard having one master and a number of replicas).

You would need to study how to shard your index, using built in routing based on a hash or provide your own.

Edit: so if I understand correctly, you are assuming that every node in the cluster must have either a master or a replica of EVERY shard, correct? If so, the answer is no. In order to provide resilience, you need to have master/replicas of every shard somewhere in the cluster, but you can have a node N that does not contain anything from shard S, as long as S has a master and a replica (at least) in other nodes.