Checking a TcpClient is actually connected

962 views Asked by At

I have, like many, been delving into the subject of testing whether TCP sessions are active/alive. It seems like an unnecessarily difficult problem with too many half-effective solutions. A connection doesn't know anything until it tests itself. Then attempts to send may succeed despite the connection actually being lost. Polling seems to deliver false positives for connection. Some servers are configured to not respond to pings. The only real test seems to be in trying to make a fresh connection and sensing whether the attempt was successful. This seems unnecessarily heavy-handed, but it seems a but crazy that the protocol doesn't have a lightweight way of answering the question of 'in this particular instant, is it possible to transfer data from client to server and verify that it was received?'

I am working using the .net framework and the exposed TCP objects within it. When disconnecting the network cable, surely this would create an immediate signal to all consumers that the connection was lost. This isn't the case however and nothing I can sense about the connection is aware of this loss. Only trying to re-establish the connection discovers that the physical link has been broken.

What am I missing?

1

There are 1 answers

4
JimD. On BEST ANSWER

TCP doesn't really work they way you seem to think it does, although there are some things we can do to make it work better for you. But first let's understand a little better how it works and why you see the behavior you do.

When you open a TCP connection, TCP uses a 3-way handshake to set up the connection. The client sends a SYN, the server responds with SYN+ACK, and then the client sends back an ACK. If neither side tries to send anything the connection will just sit there idle. You can unplug the cable from your machine. A tree can fall and take out your internet service. The internet provider can come repair your internet service, and you can plug the cable back into the ethernet port. And then the client can write to the socket and it should be delivered to the server. (Firewalls unfortunately deliberately break standards, and your firewall may have decided to time out the connection while you were waiting for your ISP to fix your service.) However, if you tried to make another connection while the cable was unplugged, TCP would try to send a SYN, and most likely discover that there is "no route to host." So it can't set up a new connection.

If you had tried to write to the socket while your internet service was out, TCP would try to send the data and wait for an ACK from the server. After a retransmission timeout, if it hasn't received an ACK, it will try again and exponentially back off on the timeout. After typically 15 tries it will give up, which would typically take anywhere between half an hour to an hour and a half.

As you can see, TCP is trying to be resilient in the face of failure, whereas you want to learn about failures very quickly. Systems that need to react quickly to connection failure (such as electronic stock exchanges which typically cancel open orders on connection failure) handle this as part of a higher level protocol by sending heartbeat messages periodically and taking action when a heartbeat is sufficiently overdue.

But if you can't control the protocol, there are some socket options you can use to improve the situation. SO_KEEPALIVE causes TCP to periodically send keepalive packets and it will eventually time out depending on the settings of TCP_KEEPIDLE, TCP_KEEPINTVL, and TCP_KEEPCNT. TCP_USER_TIMEOUT allows you to set a timeout for how long data written to a socket can remain unacknowledged.

How exactly these two options work and interact are implementation dependent, and you have to consider what is going to happen when there is no unacknowledged data, when there is unacknowledged data, and when there is a slow consumer resulting in a zero window. In general it is advisable to use them together with TCP_USER_TIMEOUT set to (TCP_KEEPIDLE + TCP_KEEPINTVL*TCP_KEEPCNT) * 1000 to get a consistent result.

Our friends a Cloudflair have a nice Blog entry about how exactly these work together, but on Linux, unfortunately. I'm not aware of anything as comprehensive as this for Windows.