Let us say I have a partition (partition-0) with 4 segments that are committed and are eligible for compaction. So all these segments will not have any duplicate data since the compaction is done on all the 4 segments.
Now, there is an active segment which is still not closed. Meanwhile, if the consumer starts reading the data from the partition-0, does it also read the messages from active segment?
Note: My goal is to not provide duplicate data to the consumer for a particular key.
Your concerns are valid as the Consumer will also read the messages from the active segment. Log compaction does not guarantee that you have exactly one value for a particular key, but rather at least one.
Here is how Log Compaction is introduced in the documentation:
However, you can try to get the compaction running more frequently to have your active and non-compated segment as small as possible. This, however, comes at a cost as running the compaction log cleaner takes up ressources.
There are a lot of configurations at topic level that are related to the log compaction. Here are the most important and all details can be looked-up here:
However, I am quite convinced that you will not be able to guarantee that your consumer is never getting any duplicates with a log compacted topic.