Strange errors with stream management in ejabberd

1.1k views Asked by At

I’m building an instant messenger app on iOS that uses ejabberd. I’m currently testing the stream management feature and in particular the resumption that seems to work in most cases. However there is a case I don’t understand, that I can replicate through the following steps, taking in account the settings: resume_timeout: 30, resend_on_timeout: if_offline

  • at the beginning client A and client B are connected, no other resources are connected
  • client B crashes or disconnects in a not clean way
  • client A starts to send a bunch of messages (10+) very quickly
  • ejabberd sends an ack to A for each message sent to confirm that the messages reached the server
  • around 20 seconds since the crash, B reconnects. At this instant A receives an error for each message sent before
<message xmlns="jabber:client" from="clientB@mydomain" to="clientA@mydomain/resourceID" type="error" id="CFBF4583-209A-4453-2567-CCCC7894827E">
   <body>test</body>
   <active xmlns="http://jabber.org/protocol/chatstates" />
   <request xmlns="urn:xmpp:receipts" />
   <error code="503" type="cancel">
       <service-unavailable xmlns="urn:ietf:params:xml:ns:xmpp-stanzas" />
   </error>
</message>

I tried with ejabberd 16.01.

This happens 80% of the time; sometimes messages sent by A are correctly delivered to B on reconnection within the 30 seconds.

My questions are:

  • is this behavior correct? I would expect that no error is bounced to client A if an ack has already been received for a message.
  • since resend_on_timeout is set to if_offline and no other resource is connected, I would expect no errors at all. Am I correct?
2

There are 2 answers

2
xnyhps On BEST ANSWER
  • Stream Management acks only indicate that the message has been received by your server. It doesn't imply that the message has been processed or delivered to the specified address. Even if it were delivered to the address, then that device can still return an error for the stanza.
  • This is really just a stab in the dark, but after having a glance over the ejabberd code, this could be what happens:

    1. clientB@mydomain/ResourceB drops their connection, there is now a session awaiting resumption using ResourceB.
    2. Client B reconnects, doesn't resume (because it crashed and lost its state).
    3. Client B binds resource ResourceB again.
    4. Now the server has to terminate the sleeping session that was waiting for resumption because client B requested the same resource.
    5. The server checks whether there are other sessions because it is set to if_offline.
    6. The server sees there is a session (the new session) and therefore chooses to bounce instead of resend.

    So my theory is that if_offline only checks if there are other sessions when queue of unacknowledged messages needs to be handled, not at the time the message was originally received.

0
Holger Weiß On

@xnyhps' response is correct, and I fixed this particular corner case for the next ejabberd release. However, @xnyhps is also correct that there are other corner cases, so if you want reliable message delivery, you should be using XEP-0313. The main feature of XEP-0198 is session resumption.