Stream processing architecture

328 views Asked by At

I am in the process of designing a system where there's a main stream of objects and there are multiple workers which produces some result from that object. Finally, there is some special/unique worker (sort of a "sink", in terms of graph theory) which takes all the results, and process them to some final object which is written to some DB.

It is possible for a worker to be dependent on the result of some other workers (hence, waiting for their results)

Now, I'm facing several problems:

  1. It could be that one worker is much slower than another. How do you deal with that? Adding more workers (= scaling) of the slower type? (maybe dynamically)
  2. Suppose W_B is dependent on W_A. If W_B is down for some reason then the flow will stop and the system will stop working. So I'd like the system to bypass this worker, somehow.
  3. Moreover, how do the final worker decide when to operate on the set of results? Suppose it has the results of A and B but lacking the result of C. It may be that C is down or it's just very slow at the moment. How can it make a decision?

It is worth mentioning that it's not a realtime application but rather an offline processing system (i.e. you may access the DB and alter a record), but at the same time, it has to deal with relatively large amount of objects in an "high pace".

Regarding technologies,
I'm developing the system with Java but I'm not bounded to a specific technology.

I'd be glad if you could help me with the general design of the system.

Thanks a lot!

2

There are 2 answers

1
daniu On

As Peter said, it really depends on the use case. Some general remarks though:

  1. If a worker is slower than the other, maybe create more instances of that type; eg Kubernetes allows dynamic Node creation, and Kafka allows to partition a topic so more than one instance can read off and process it.

  2. If B depends on A and A is down, B can't work and that's it. Maybe restart A? Maybe you can do a regular health check on it.

  3. If the final worker needs the results of A, B and C, how would it process without C being available? If it can, it can store the results of A and B, install a timer, and if that goes off without C having arrived, continue.

0
David Anderson On

Some additional thoughts:

  1. If you mean to say that some subtasks of the overall application are quicker to execute than others, then it can be a good idea to slice up the application so that each worker is doing a bit of everything -- in other words, a share of the quick work and a share of the slow work. But if you mean to say that some machines are slower than others, then you could run fewer workers on the slow machines, and more on the faster ones, so as to balance things so that each worker has roughly the same resources.

  2. You might want to decouple your architecture with some sort of durable queueing between the workers.

  3. It's common to use heartbeats with timeouts and restarts.

Distributed stream processing quickly becomes very complex. Your life will be much easier if you build on top a stream processing framework that provides high availability and exactly-once semantics out of the box.