Using Kafka as an event store works fine, its easy to just set message retention to unlimited.
But I've seen some reports on Kafka being used for event sourcing also. And here is where I get confused on how that is possible. As an event store, I can just shove my messages in there. and consume or replay as needed.
But for event sourcing, you most likely want to read the events for a given entity/aggregate ID. You could of course use partitions, but that seems like abusing the concept and it would be hard to actually add new entities as the partition count is more on the static side, even if you can change it. Are there any sane solution to this out there? The Apache Kafka docs themselves only mention Event Sourcing briefly.
I think Apache Kafka is the best solution to store events for Event Sourcing. The concept of Event Sourcing is very close and usually works together with a concept/practice named CQRS by Greg Young which I suggest you to research.
The term repository I am using in this answer is a repository in terms of Domain Driven Design as in Eric Evans book.
I think I know what is a reason of the confusion you have.
The above of your question is in my opinion true. ButI think that you wanted to express something different. You wanted to express something like this:Is it what you want to express really? BecauseThe sentence above is in my opinion false.You do not need to rebuild an object every time you retrieve it.
In other words, you don't need to replay all the events that constitutes the object every time you retrieve the object from repository. You can play events on the object and store current version of the object in a different way, e.g. in cache or even better, both in cache and in kafka.
So let's talk an example. Let's say we have a track/lorry that is loaded and unloaded.
The main stream of the events will be operations - this will be 1st kafka topic in our app. This will be our source of truth as Jay Kreps usually says it his papers.
These are the events:
The final result is that track 1 is filled with flowers and track 2 is filled with sand.
What you do is you read the events from this topic and populate your 2nd topic: trackUpdated. The events you stream into trackUpdated topic are following:
At the same time, with every message consumed, you update current version of the truck in cache e.g. memcached. So memcache will be a direct source which repository use to retrive track object.
What is more you make the trackUpdated topic a compacted topic.
Read about compacted topics in Apache Kafka official docs. There is lot of interesing materials about it on Confluent blog and Linkedin Engineering blog (before Confluent company was started).
So because of the fact that trackUpdated is compated Kafka makes it look like this after some time:
Kafka will do this if you use track ID as a key for all messages - read in the docs what message "keys" are. So you end up with 1 message for every track. If you find bug in your app you can replay operations topic to populate your cache and trackUpdated topic again. If your cache goes down you can use trackUpdated topic to populate your cache.
What do you think? Votes and comments highly welcomed.
Update:
(1) After some thoughts I have changed my mind that a quote of you is true. I find it false now. So I do **not(( think that for event sourcing, you most likely want to read the events for a given entity/aggregate ID.
When you find a bug in your code you want to replay all the events for all objects. It does not matter if there are 2 entities as in my simple example or there is 10M entities.
Event sourcing is not about retrieving all the events for particular entity. Event sourcing is about that you have your audit log of all events and you are able to replay them to rebuild your entities. You don't need to be able to rebuild single particular entity.
(2) Strongly advice to get familiar with some blog posts of Confluent and LinkedIn engineering blogs. The below was very interesing for me:
https://www.confluent.io/blog/making-sense-of-stream-processing/
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
Official Kafka Docs is a must too.