I am evaluating Apache Flink for my usecase. My question is about how to organize the code for a "complex" stream.

The usecase is an IoT process. Sensors produce events - this is the input of my stream. My stream application outputs alerts. The first step of my stream is to process some aggregated features on these data (average over window, min, max, etc). Second step of my stream is to run some "decision" process on input data and aggregated data. This second step is composed of 2 parallel processes:

  • First one is a set of user-defined rules (example: if temperature sensor average is >50°, but latest one is below 30°, then generate an alert)
  • Second one is to run some machine learning models

A graph of what I want to do:

                                             +-----------------+               
                  +----------------+         |   User rules    |------>  Alerts
                  |                |-------->|   (multiple)    |               
                  |   Aggregates   |         +-----------------+               
  Sensors ------->|                |                                           
                  |   (multiple)   |         +-----------------+               
                  |                |-------->|    ML rules     |-------> Alerts
                  +----------------+         |   (multiple)    |               
                                             +-----------------+               

How should I organize my Flink application?

I have in mind 3 ways to do it:

1) Put all my code in a single project

Pros:

  • This would put all the code in the same place, no need to switch to dozens of applications to understand how it works and what it does
  • I would not need to store intermediate results in any other topics - I would be able to use them directly.
  • Easy deployment

Cons:

  • The main file of the application could quickly become a mess (would it?).
  • I would have to redeploy everything every time I update something (a new rule, a new aggregate, etc)

2) Put the enrichment part in a project, put all the user defined rules in another one, put the machine learning part in another

Pros:

  • Code doing the same thing is in the same place
  • Looks easy to deploy. Only 3 applications to deploy

Cons:

  • I would have to use a broker so that producers and consumers can communicate (aggregates are written to a topic, and then user rules go read them to use them), and I would have to join streams

3) Every aggregate to process is a project, every rule is a project, every ML model is a project

Pros:

  • Easy update. Would scale well with the team.
  • Easy way for newcomers to write something and not break everything
  • Seems like it would scale well - time consuming user defined rule would not impact others

Cons:

  • A mess to keep track of what is deployed and their versions
  • I would have to use a broker so that producers and consumers can communicate (aggregates are written to a topic, and then user rules go read them to use them), and I would have to join streams
  • Lots of redundant code / maybe need to create libraries
  • Deployment could become a mess if I get to hundreds or thousands of aggregates and rules

I am missing the experience on Flink, and on Streaming in general, to know what would be the most fit way for my usecase. I am thinking about starting with the second solution, which seems to be the best compromise.

1 Answers

1
David Anderson On

One approach you may want to consider would be to stream in some of the slowly changing components, rather than compiling them in. The user rules, for example, or even the aggregate definitions and machine learning models. This will add complexity to the implementation, but allow for changes without having to redeploy.

RBEA from King and ING's work on streaming ML models are early examples of this pattern. With broadcast state it's now easier to build this kind of dynamic rules engine with Flink.