I'm trying to find the right tool for the job. I've explored a few different message queues like Kafka, Kestrel, etc... and I'm looking for something that has a PULL functionality.
I have an API (distributed) that shoves the incoming messages into the queue. I'd then have workers (separate machines) that pull messages from the queue. This ensures that the workers don't get flooded and can't handle the load of the queue.
I'm wondering if Kafka or Kestrel supports this type of functionality
Kafka does work on the push - pull basic and capable of handling large scale real time streams. Also as mentioned in their documentation Kafka's performance is effectively constant with respect to data size so retaining lots of data will not be a problem.
For processing stream, checkout Storm. It's free , fault-tolerant , distributed real-time computation system and very easy to scale. It does exactly what you've mentioned (running workers on separate machines). And it also supports transactional topologies. On top of that, it has a very nice integration with Apache Kafka.
For more on Storm, check here.
So typically what you can do is retrieve message from Kafka queue using their consume API and then feed it to a storm cluster to do the rest in a distributed manner. Kafka 0.8 provides 2 types of API,
High Level or consumer group
Low level or Simple consumer API
The former provides a high level abstraction for consuming data and takes care of lot of things like threading, error handling, while the later allows a much greater control over message handling like reading a message multiple times, message transaction etc.
High level consumer API example
Simple Consumer example