We have been seeing the following behavior and aren't sure if this is a known bug or simply a misconfiguration or misuse of the library.
- Using curator-framework 2.7.0 Scala library, zookeeper-3.4.5
- Run scala app that connects to local zk server at 127.0.0.1:2181. We have reproduced this with different retry policies, but to keep it simple let's assume our retry policy sleeps for 30 seconds and retries indefinitely.
- Tail both scala app logs and local zk server logs
- Run "sudo iptables -A OUTPUT -p tcp --dport 2181 -j DROP" and wait.
- Eventually see SUSPENDED state change logs appear in scala app log.
- Eventually "Session expiration" logs appear on zk server logs.. If we lift iptables now the scala app will register a LOST followed by a RECONNECTED. This is what we expect.
- If instead we continue waiting instead of lifting iptables right after the server logs SessionExpiration, we see retryPolicy events fire in the log and fail. Still expected as far as I can tell.
- Problem is if we lift iptables after a "long time" after which several retries occur. Here what appears to happen is a RECONNECTED with new session id and no LOST state change. The end result is that we are connected but have lost all ephemeral data and do not attempt to rebuild it because this logic was tied to the LOST state change.
It appears that this has something to do with the client session id "timing out" or "clearing" such that the server reconnect assumes client already knows the session expired. Any confirmation on this? Our current thought is to cache session id before and after and simulate our own LOST state change but this seems kind of like we are fighting the api.
Thanks
The Curator connection states are not directly related to the standard ZooKeeper events. So LOST does not mean ZK session loss. It means that Curator believes the connection has been lost based on your retry settings, etc. See the Notifications section here: http://curator.apache.org/errors.html