I am testing ZeroMQ as Pub-Sub (service bus style) infra for a medium system. We have about 50 nodes, all of them should be publishers and subscribers. The network is kind of a star topology, but the edges "talk" with each other. We require Dynamic discovery (no need to hard-code the network addresses of the participants) but also no SPOF (Single Point of Failure).
I have read http://zeromq.org/whitepapers:0mq-3-0-pubsub and from what I understand, the suggested 0MQ way for dynamic discovery involves an proxy node (XPUB/XSUB) which forwards subscriptions and publications. I considered using such a proxy as a central mediator in our system, however, I have the following concerns with this architecture: (A) The proxy node is a SPOF - when it fails the whole system is not functioning (B) All traffic, including data, passes through the proxy node, which means the latency & performance issue.
Assuming I understood the pub-sub whitepaper correctly, is there a relatively simple way for achieving pub-sub + dynamic-discovery + no-SPOF in ZeroMQ?
Additional point: I have ruled out multicast (PGM) solution because most messages have a single/few interested parties and we do not like to overcrowd the network.
Multiple subscribers with a single publisher requires no intermediary as subscribers can talk directly to the publisher. But many publishers and subscribers at the same time is not so easy; unless there's something in the middle, maintenance will be a nightmare as new subscribers have to be configured with all existing publishers.
You could deploy several XSUB/XPUB proxies, each on their own machine, then deploy a load-balancer (like F5) between the publishers and the proxies. This achieves load-balancing and fault tolerance on the upstream side.
The proxy code is simple:
If a proxy node fails, just restart it; re-connections/subscriptions should be handled automatically by zmq.
For downstream subscribers, connect each subscriber directly to all available proxies:
Publishers will come and go more often than proxies, so connecting subscribers directly to proxies results in less configuration maintenance since the number of proxies will, for the most part, be static.
If a proxy node fails, the upstream LTMs route traffic accordingly to the remaining proxy nodes; the subscribers won't be affected since they consume from all available proxies.
Slow subscriber may be addressed with syncing, read up on this.
Check out subscription-forwading and minimizing network traffic here.