I realise that I'll get at least one answer along the lines of "(re)write the code so it doesn't hang" but let's assume we don't live in that shiny happy utopia just yet...
In our embedded system we have a big SDK including a web-server (Boa) which is the primary method of user interaction.
It's possible, during certain phases of the moon, that something can cause the web server to hang or become otherwise stuck in such a way that the process appears running normally (not crashed/dead/using 100% CPU) but does not serve any web pages.
So, the question is, how do we test/detect this situation?
To test whether the server is hung, create a TCP socket and connect to port
80
on IP address127.0.0.1
(loopback address). Then send the following text over the socketMost servers will interpret that as a request for
index.html
. Alternatively, you could implement an undocumented URL for testing (which allows for a shorter, predetermined response), e.g.You then need to read the response from the server. This involves using
select
with a reasonable timeout to determine whether any data came back from the server, and if so, userecv
to read the data. The response from the server will consist of a header followed by content. The header consists of lines of text, with a blank line at the end of the header. Lines end with\r\n
, so the end of the header is\r\n\r\n
.Getting the content involves calling
select
andrecv
untilrecv
returns 0. This assumes that the server will send the response and then close the socket. Some sophisticated servers will leave a socket open to allow multiple requests over the same socket. A simple embedded server should not be doing that. (If your server is trying to use the same socket for multiple requests, then you need to figure out how to turn that feature off.)That's all very well and good, but you really need to rewrite your code so it doesn't hang.
The mostly likely cause of the problem is that the server has a bunch of dangling sockets, i.e. connections from clients that were never properly cleaned up. Dangling sockets will eventually prevent the server from accepting more connections, either because the server has a limit on the number of open connections, or because the process that's running the server uses up all of its file descriptors.
The first thing to check is the TCP timeout value. One project that I worked on had a default timeout of 5 hours, which meant that dangling sockets stayed open for 5 hours. A reasonable timeout is 1 minute.
Then you need to create a client that deliberately misbehaves. Clients can misbehave by
The first situation should be handled by the TCP timeout. The other two need to be properly handled by the server code. Graceful and abrupt socket closure is controlled via the SO_LINGER option of
ioctl
and theshutdown
function. After the client misbehaves, check the number of open file descriptors in the server process, to verify that the server has handled the situation correctly.