Akka cluster sharding: recovering on journal corruption

149 views Asked by At

This question may be a bit vague, but I'm unsure how to make it more precise.

While using the cluster sharding extension, you have to provide some sort of persistence journal so that the plugin can store its metadata (ShardRegionAllocated, etc...).

These metadata are used when new actors are instantiated / moved across nodes to recover from their frozen state.

Suppose that for any reason your journal becomes corrupted (loses one entry, duplicates an entry, whatever). This leads to pretty bad exceptions at the actor's startup (Persistence recovery failure), possibly terminating the whole region if not correctly handled.

What is the best way to manage this scenario? (I'm asking for ideas at any level of the stack, from the supervisor's policy to some sort of intervention directly on the journal). Thanks,

D.

1

There are 1 answers

0
Diego Martinoia On BEST ANSWER

Checked with the Akka usergoup: in the future there will possibly be better options (still in research), but for now it should be safe to stop the cluster, delete the metadata and restart it.

Unfortunately, there does not seem to be a way to do this without downtimes.