Ingesting data in elasticsearch from hdfs , cluster setup and usage

227 views Asked by At

I am setting up a spark cluster. I have hdfs data nodes and spark master nodes on same instances.

Current setup is 1-master (spark and hdfs) 6-spark workers and hdfs data nodes

All instances are same, 16gig dual core (unfortunately).

I have 3 more machines, again same specs. Now I have three options 1. Just deploy es on these 3 machines. The cluster will look like 1-master (spark and hdfs) 6-spark workers and hdfs data nodes 3-elasticsearch nodes

  1. Deploy es master on 1, extend spark and hdfs and es on all other. Cluster will look like 1-master (spark and hdfs) 1-master elasticsearch 8-spark workers, hdfs data nodes, es data nodes

My application is heavily use spark for joins, ml etc but we are looking for search capabilities. Search we definitely not needed realtime and a refresh interval of upto 30 minutes is even good with us.

At the same time spark cluster has other long running task apart from es indexing.

The solution need not to be one of above, I am open with experimentation if some one suggest. It would be handy for other dev's also once concluded.

Also I am trying with es hadoop, es-spark project but I felt ingestion is very slow if I do 3 dedicated nodes, its like 0.6 million records/minute.

1

There are 1 answers

3
Armin Braun On

The optimal approach here mostly depends on your network bandwidth and whether or not it's the bottleneck in your operation in my opinion.

I would just check whether my network links are saturated via say iftop -i any or similar and check if that is the case. If you see data rates close to the physical capacity of your network, then you could try and run hdfs + spark on the same machines that run ES to save the network round trip and speed things up.

If network turns out not to be the bottleneck here, I would look into the way Spark and HDFS are deployed next. Are your using all the RAM available (Java Xmx set high enough?, Spark memory limits? Yarn memory limits if Spark is deployed via Yarn?)

Also you should check whether ES or Spark is the bottleneck here, in all likelihood it's ES. Maybe you could spawn additional ES instances, 3 ES nodes feeding 6 spark workers seems very sub-optimal. If anything, I'd probably try to invert that ratio, fewer Spark executors and more ES capacity. ES is likely a lot slower at providing the data than HDFS is at writing it (though this really depends on the configuration of both ... just an educated guess here :)). It is highly likely that more ES nodes and fewer Spark workers will be the better approach here.

So in a nutshell:

  • Add more ES nodes and reduce Spark worker count
  • Check if your network links are saturated, if so put both on the same machines (this could be detrimental with only 2 cores, but I'd still give it a shot ... you gotta try this out)
  • Adding more ES nodes is the better bet of the two things you can do :)