I've a very general question regarding the behaviour of distributed systems. Attached the initial situation:
I've developed my own Big Data system with vert.x and Java 8 map and reduce functions.
Up to a certain size of packets (I split 4 gigabyte of total data into 4,8,16,32 MB packets) that will be distributed among all hosts (in my case maximum 8), the distribution of packets worked fine and was distributed across all the machines.
But as soon as I increase the packet size, one host was getting most of the computational work to do while the others did very less work and received very less packets. the optimal size was 4 MB where the load was distributed fine.
My questions are now:
- Could this be because of a too fast map reduce worker and a too slow data generation?
- May this issue be related to a missing load balancer?
- Or is it in the nature of a distributed system?