I'm receiving events from an EventHub using EventProcessorHost and an IEventProcessor class (call it: MyEventProcessor). I scale this out to two servers by running my EPH on both servers, and having them connect to the Hub using the same ConsumerGroup, but unique hostName's (using the machine name).
The problem is: at random hours of the day/night, the app logs this:
Exception information:
Exception type: ReceiverDisconnectedException
Exception message: New receiver with higher epoch of '186' is created hence current receiver with epoch '186' is getting disconnected. If you are recreating the receiver, make sure a higher epoch is used.
at Microsoft.ServiceBus.Common.ExceptionDispatcher.Throw(Exception exception)
at Microsoft.ServiceBus.Common.Parallel.TaskHelpers.EndAsyncResult(IAsyncResult asyncResult)
at Microsoft.ServiceBus.Messaging.IteratorAsyncResult`1.StepCallback(IAsyncResult result)
This Exception occurs at the same time as a LeaseLostException, thrown from MyEventProcessor's CloseAsync method when it tries to checkpoint. (Presumably Close is being called because of the ReceiverDisconnectedException?)
I think this is occurring due to Event Hubs' automatic lease management when scaling out to multiple machines. But I'm wondering if I need to do something different to make it work more cleanly and avoid these Exceptions? Eg: something with epochs?
TLDR: This behavior is absolutely normal.
Why can't Lease Management be smooth & exception-free: To give more control on the situation to developer.
The really long story - all-the-way from Basics
EventProcessorhost(herebyEPH- is very similar to what__consumer_offset topicdoes forKafka Consumers- partition ownership & checkpoint store) is written byMicrosoft Azure EventHubsteam themselves - to translate all of theEventHubs partition receiver Guinto a simpleonReceive(Events)callback.EPHis used to address 2 general, major, well-known problems while reading out of a high-throughput partitioned streams likeEventHubs:fault tolerant receive pipe-line - for ex: a simpler version of the problem - if the host running a
PartitionReceiverdies and comes back - it needs to resume processing from where it left. To remember the last successfully processedEventData,EPHuses theblobsupplied toEPHconstructor to store the checkpoints - when ever user invokescontext.CheckpointAsync(). Eventually, when the host process dies (for ex: abruptly reboots or hits a hardware fault and never/comesback) - anyEPHinstance can pick up this task and resume from thatCheckpoint.Balance/distribute partitions across
EPHinstances - lets say, if there are 10 partitions and 2EPHinstances processing events from these 10 partitions - we need a way to divide partitions across the instances (PartitionManagercomponent ofEPHlibrary does this). We useAzure Storage - Blob LeaseManagement-featureto implement this. As of version2.2.10- to simplify the problem,EPHassumes that all partitions are loaded equally.With this, lets try to see what's going on: So, to start with, in the above example of
10event hub partitions and2EPHinstances processing events out of them:EPHinstance -EPH1started, at-first, alone and a part of start-up, it created receivers to all 10 partitions and is processing events. In the start up -EPH1will announce that it owns all these10partitions by acquiring Leases on10storage blobs representing these10event hub partitions (with a standardnomenclature- whichEPHinternally creates in the Storage account - from theStorageConnectionStringpassed to thector). Leases will be acquired for a set time, after which theEPHinstance will loose the ownership on this Partition.EPH1continuallyannouncesonce in a while - that it is still owning those partitions - byrenewingleases on the blob. Frequency ofrenewal, along with other useful tuning, can be performed usingPartitionManagerOptionsEPH2starts up - and you supplied the sameAzureStorageAccountasEPH1to thectorofEPH2as well. Right now, it has0partitions to process. So, to achieve balance of partitions acrossEPHinstances, it will go ahead anddownloadthe list of allleaseblobswhich has the mapping ofownertopartitionId. From this, it willSTEALleases for its fair share ofpartitions- which is5in our example, and will announce that information on thatlease blob. As part of this,EPH2reads the latest checkpoint written byPartitionXit wants to steal the lease for and goes ahead and creates correspondingPartitionReceiver's with theEPOCHsame as the one in theCheckpoint.EPH1will loose ownership of these 5partitionsand will run into different errors based on the exact state it is in.EPH1is actually invoking thePartitionReceiver.Receive()call - whileEPH2is creating thePartitionReceiveron the same receiver -EPH1will experience ReceiverDisconnectedException. This will eventually, invokeIEventProcessor.Close(CloseReason=LeaseLost). Note that, probability of hitting this specific Exception is higher, if the messages being received are larger or thePrefetchCountis smaller - as in both cases the receiver would be performing more aggressive I/O.EPH1is in the state ofcheckpointingtheleaseorrenewingthelease, while theEPH2stolethe lease, theEventProcessorOptions.ExceptionReceivedeventHandler would be signaled with aleaselostException(with409conflict error on theleaseblob) - which also eventually invokesIEventProcess.Close(LeaseLost).Why can't Lease Management be smooth & exception-free:
To keep the consumer simple and error-free, lease management related exceptions could have been swallowed by
EPHand not notified to the user-code at all. However, we realized, throwingLeaseLostExceptioncould empower customers to find interesting bugs inIEventProcessor.ProcessEvents()callback - for which the symptom would be - frequent partition-movesEPH1fails torenewleases and comes back up! - and imagine if the n/w of this machine stands flaky for a day -EPHinstances are going to playping-pongwithPartitions! This machine will continuously try to steal the lease from other machine - which is legitimate fromEPHpoint-of-view - but, is a total disaster for the user ofEPH- as it completely interferes with the processing pipe.EPH- would exactly see aReceiverDisconnectedException, when the n/w comes back up on this flaky m/c! We think the best and infact the only way is to enable the developer to smell this!ProcessEventslogic - which throws unhandled exceptions which are fatal and brings down the whole process - ex: a poison event. This partition is going to move around a lot.EPHis also using - by mistake (like an automated clean-up script) etc.outageon Azure d.c where a specificEventHub.Partitionis located - say n/w incident. Partitions are going to move around acrossEPHinstances.Basically, in majority of situations, it would be tricky - for us to detect the diff. between these situations and a legitimate leaseLost due to balancing and we want to delegate control of these situations to the Developer.
more on Event Hubs...