I'm running Eclipse Ditto v2.5.0 on EKS (helm chart) and after a couple of days the service stops working. It doesn't return any results nor is persisting new things working. I've found the following in the logs:
2022-06-28T08:06:12+02:00 Caused by: akka.stream.RemoteStreamRefActorTerminatedException: [SourceRef-139] Remote partner [Actor[akka://[email protected]:2551/system/Materializers/StreamSupervisor-0/$$q2c-SinkRef-139#-1677314214]] has terminated unexpectedly and no clean completion/failure message was received (possible reasons: network partition or subscription timeout triggered termination of partner). Tearing down.
2022-06-28T08:06:12+02:00 at akka.stream.impl.streamref.SourceRefStageImpl$$anon$1.onTimer(SourceRefImpl.scala:374)
2022-06-28T08:06:12+02:00 at akka.stream.stage.TimerGraphStageLogic.onInternalTimer(GraphStage.scala:1665)
2022-06-28T08:06:12+02:00 at akka.stream.stage.TimerGraphStageLogic.$anonfun$getTimerAsyncCallback$1(GraphStage.scala:1654)
2022-06-28T08:06:12+02:00 at akka.stream.stage.TimerGraphStageLogic.$anonfun$getTimerAsyncCallback$1$adapted(GraphStage.scala:1654)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.GraphInterpreter.runAsyncInput(GraphInterpreter.scala:467)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.GraphInterpreterShell$AsyncInput.execute(ActorGraphInterpreter.scala:517)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.GraphInterpreterShell.processEvent(ActorGraphInterpreter.scala:625)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.ActorGraphInterpreter.akka$stream$impl$fusing$ActorGraphInterpreter$$processEvent(ActorGraphInterpreter.scala:800)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.ActorGraphInterpreter$$anonfun$receive$1.applyOrElse(ActorGraphInterpreter.scala:818)
2022-06-28T08:06:12+02:00 at akka.actor.Actor.aroundReceive(Actor.scala:537)
2022-06-28T08:06:12+02:00 at akka.actor.Actor.aroundReceive$(Actor.scala:535)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.ActorGraphInterpreter.aroundReceive(ActorGraphInterpreter.scala:716)
2022-06-28T08:06:12+02:00 ... 10 common frames omitted
2022-06-28T08:06:12+02:00 2022-06-28 08:06:12,408 ERROR [] o.e.d.i.u.a.ThingsAggregatorProxyActor akka://ditto-cluster/user/gatewayRoot/proxy/aggregatorProxy - [retrieve-thing-response] Upstream failed.
2022-06-28T08:06:12+02:00 akka.stream.RemoteStreamRefActorTerminatedException: [SourceRef-137] Remote partner [Actor[akka://[email protected]:2551/system/Materializers/StreamSupervisor-0/$$m2c-SinkRef-137#934810721]] has terminated unexpectedly and no clean completion/failure message was received (possible reasons: network partition or subscription timeout triggered termination of partner). Tearing down.
2022-06-28T08:06:12+02:00 at akka.stream.impl.streamref.SourceRefStageImpl$$anon$1.onTimer(SourceRefImpl.scala:374)
2022-06-28T08:06:12+02:00 at akka.stream.stage.TimerGraphStageLogic.onInternalTimer(GraphStage.scala:1665)
2022-06-28T08:06:12+02:00 at akka.stream.stage.TimerGraphStageLogic.$anonfun$getTimerAsyncCallback$1(GraphStage.scala:1654)
2022-06-28T08:06:12+02:00 at akka.stream.stage.TimerGraphStageLogic.$anonfun$getTimerAsyncCallback$1$adapted(GraphStage.scala:1654)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.GraphInterpreter.runAsyncInput(GraphInterpreter.scala:467)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.GraphInterpreterShell$AsyncInput.execute(ActorGraphInterpreter.scala:517)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.GraphInterpreterShell.processEvent(ActorGraphInterpreter.scala:625)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.ActorGraphInterpreter.akka$stream$impl$fusing$ActorGraphInterpreter$$processEvent(ActorGraphInterpreter.scala:800)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.ActorGraphInterpreter$$anonfun$receive$1.applyOrElse(ActorGraphInterpreter.scala:818)
2022-06-28T08:06:12+02:00 at akka.actor.Actor.aroundReceive(Actor.scala:537)
2022-06-28T08:06:12+02:00 at akka.actor.Actor.aroundReceive$(Actor.scala:535)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.ActorGraphInterpreter.aroundReceive(ActorGraphInterpreter.scala:716)
2022-06-28T08:06:12+02:00 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580)
2022-06-28T08:06:12+02:00 at akka.actor.ActorCell.invoke(ActorCell.scala:548)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.ActorGraphInterpreter.akka$stream$impl$fusing$ActorGraphInterpreter$$processEvent(ActorGraphInterpreter.scala:800)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.ActorGraphInterpreter$$anonfun$receive$1.applyOrElse(ActorGraphInterpreter.scala:818)
2022-06-28T08:06:12+02:00 at akka.actor.Actor.aroundReceive(Actor.scala:537)
2022-06-28T08:06:12+02:00 at akka.actor.Actor.aroundReceive$(Actor.scala:535)
2022-06-28T08:06:12+02:00 at akka.stream.impl.fusing.ActorGraphInterpreter.aroundReceive(ActorGraphInterpreter.scala:716)
2022-06-28T08:06:12+02:00 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580)
2022-06-28T08:06:12+02:00 at akka.actor.ActorCell.invoke(ActorCell.scala:548)
2022-06-28T08:06:12+02:00 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
2022-06-28T08:06:12+02:00 at akka.dispatch.Mailbox.run(Mailbox.scala:231)
2022-06-28T08:06:12+02:00 at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
2022-06-28T08:06:12+02:00 at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
2022-06-28T08:06:12+02:00 at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
2022-06-28T08:06:12+02:00 at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
2022-06-28T08:06:12+02:00 at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
2022-06-28T08:06:12+02:00 at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
2022-06-28T08:06:12+02:00 2022-06-28 08:06:12,410 ERROR [78dae9eb-4515-4513-9930-3060f7ba9652] o.e.d.g.s.e.a.HttpRequestActor akka://ditto-cluster/user/$Xe - Got <Status.Failure> when a command response was expected: <akka.stream.RemoteStreamRefActorTerminatedException: [SourceRef-137] Remote partner [Actor[akka://[email protected]:2551/system/Materializers/StreamSupervisor-0/$$m2c-SinkRef-137#934810721]] has terminated unexpectedly and no clean completion/failure message was received (possible reasons: network partition or subscription timeout triggered termination of partner). Tearing down.>!
2022-06-28T08:06:12+02:00 java.util.concurrent.CompletionException: akka.stream.RemoteStreamRefActorTerminatedException: [SourceRef-137] Remote partner [Actor[akka://[email protected]:2551/system/Materializers/StreamSupervisor-0/$$m2c-SinkRef-137#934810721]] has terminated unexpectedly and no clean completion/failure message was received (possible reasons: network partition or subscription timeout triggered termination of partner). Tearing down.
2022-06-28T08:06:12+02:00 at org.eclipse.ditto.gateway.service.endpoints.actors.AbstractHttpRequestActor.lambda$getResponseAwaitingBehavior$21(AbstractHttpRequestActor.java:387)
2022-06-28T08:06:12+02:00 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
2022-06-28T08:06:12+02:00 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
2022-06-28T08:06:12+02:00 at scala.PartialFunction.applyOrElse(PartialFunction.scala:214)
and
2022-06-27T16:22:19+02:00 2022-06-27 16:22:19,305 ERROR [] a.m.c.b.i.HttpContactPointBootstrap akka://[email protected]:2551/system/bootstrapCoordinator/contactPointProbe-10-20-68-87.ditto.pod.cluster.local-8558 - Overdue of probing-failure-timeout, stop probing, signaling that it's failed
How can I debug this and determine what the root cause might be?
The logs in indicate that you have tried to get several things via HTTP. The gateway service received this error as we see in:
The ThingsAggregatorProxyActor is used to get the each thing you requested from the things service in your EKS.
I would check the ditto health endpoint.
Assuming you use a nginx in your EKS you should be able to call it using the devops user under localhost:30080/status/health >>> Source
If you aren't using nginx just call the gateway pod. For example: gateway:8080/status/health
Check the logs of the things pod as well and also if the pod was restarted or had any kinds of issues.