Eventsourcing in Apache Kafka

1.4k views Asked by At

Using Kafka as an event store works fine, its easy to just set message retention to unlimited.

But I've seen some reports on Kafka being used for event sourcing also. And here is where I get confused on how that is possible. As an event store, I can just shove my messages in there. and consume or replay as needed.

But for event sourcing, you most likely want to read the events for a given entity/aggregate ID. You could of course use partitions, but that seems like abusing the concept and it would be hard to actually add new entities as the partition count is more on the static side, even if you can change it. Are there any sane solution to this out there? The Apache Kafka docs themselves only mention Event Sourcing briefly.

2

There are 2 answers

7
miglanc On

I think Apache Kafka is the best solution to store events for Event Sourcing. The concept of Event Sourcing is very close and usually works together with a concept/practice named CQRS by Greg Young which I suggest you to research.

The term repository I am using in this answer is a repository in terms of Domain Driven Design as in Eric Evans book.

I think I know what is a reason of the confusion you have.

But for event sourcing, you most likely want to read the events for a given entity/aggregate ID.

The above of your question is in my opinion true. But I think that you wanted to express something different. You wanted to express something like this:

In event sourcing, when repository is asked to retrieve an object from its data source, the repository must retrieve all events that constitutes particular entity in every request to the repository. Then these events must be replayed to build up the object.

Is it what you want to express really? Because The sentence above is in my opinion false.

You do not need to rebuild an object every time you retrieve it.

In other words, you don't need to replay all the events that constitutes the object every time you retrieve the object from repository. You can play events on the object and store current version of the object in a different way, e.g. in cache or even better, both in cache and in kafka.

So let's talk an example. Let's say we have a track/lorry that is loaded and unloaded.

The main stream of the events will be operations - this will be 1st kafka topic in our app. This will be our source of truth as Jay Kreps usually says it his papers.

These are the events:

  • track 1 is loaded with pigs
  • track 2 is loaded with pigs
  • track 2 is unloaded from pigs
  • track 2 is loaded with sand
  • track 1 is unloaded from pigs
  • track 1 is loaded with flowers

The final result is that track 1 is filled with flowers and track 2 is filled with sand.

What you do is you read the events from this topic and populate your 2nd topic: trackUpdated. The events you stream into trackUpdated topic are following:

  • track 1: pigs
  • track 2: pigs
  • track 2: nothing
  • track 2: sand
  • track 1: nothing
  • track 1: flowers

At the same time, with every message consumed, you update current version of the truck in cache e.g. memcached. So memcache will be a direct source which repository use to retrive track object.

What is more you make the trackUpdated topic a compacted topic.

Read about compacted topics in Apache Kafka official docs. There is lot of interesing materials about it on Confluent blog and Linkedin Engineering blog (before Confluent company was started).

So because of the fact that trackUpdated is compated Kafka makes it look like this after some time:

  • track 2: sand
  • track 1: flowers

Kafka will do this if you use track ID as a key for all messages - read in the docs what message "keys" are. So you end up with 1 message for every track. If you find bug in your app you can replay operations topic to populate your cache and trackUpdated topic again. If your cache goes down you can use trackUpdated topic to populate your cache.

What do you think? Votes and comments highly welcomed.

Update:

(1) After some thoughts I have changed my mind that a quote of you is true. I find it false now. So I do **not(( think that for event sourcing, you most likely want to read the events for a given entity/aggregate ID.

When you find a bug in your code you want to replay all the events for all objects. It does not matter if there are 2 entities as in my simple example or there is 10M entities.

Event sourcing is not about retrieving all the events for particular entity. Event sourcing is about that you have your audit log of all events and you are able to replay them to rebuild your entities. You don't need to be able to rebuild single particular entity.

(2) Strongly advice to get familiar with some blog posts of Confluent and LinkedIn engineering blogs. The below was very interesing for me:

https://www.confluent.io/blog/making-sense-of-stream-processing/

https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

Official Kafka Docs is a must too.

0
Dmitry Minkovsky On

Regarding your comment on the other question:

Thanks for the effort here, the answer is pretty off topic though. "Is it what you want to express really?" No. The question is not about DDD nor CQRS. I'm well familiar with those. I'm asking how or if I can use Kafka for eventsouring. Lets say I have 10 million entities, I might not want to have them all loaded in memory across servers at once. Can I load the data for a single aggregate using Kafka w/o replaying everything ?

The answer is yes: you can use Kafka Streams to process the events. Your streams logic generates aggregates and stores them in local state stores (RocksDB), so that the resulting aggregates don't need to be in memory and can be accessed without replaying all the events. You can access these aggregates with the Interactive Queries API. It's quite nice! At this time, writing event processing logic that's replayable is easier said than done, but not impossible by any means.