Windows TCP connection failures and retransmissions

1.4k views Asked by At

I have intermittent TCP connection issues in a complex application that runs on Windows.

I'm trying to determine if the problem is with my code, or a bug in Windows itself.

The system consists of a client application, server application and web application GUI. The GUI connects to the server via the API port, and the client application on a different port.

My test setup has the client program connect through an SSH tunnel that redirects to the server that is running on the same system as the client. The server also listens on an API port on localhost, on a different thread.

The code runs on build 2004 of Windows 10 in VMware workstation.

At certain points in time the server stops temporarily responding to SYN packets. New connections take 2 or 3 seconds to establish and existing connections experience lag due to re-transmissions. Since all connections from my server's/Windows's perspective are coming from localhost, and happen on two different threads I've exhausted explanations on behalf of my own code that could explain the issue.

The issue appears every 20 minutes. This also makes me suspicious that there's something else wrong, unrelated to my code.

I had the chance to obtain a packet dump from a connection attempt using CURL. Which looks like this:

enter image description here

As can be seen in the image, there's a good 3 second delay on localhost between the server responding! The server is a super simple polling design, I can't spot what the problem is. The server's code responsible for accepting this connection looks like this:


cR<void> cWindowsTCPServer::HasConnectionsPending()
{
    fd_set ReadSet = {};

    FD_ZERO(&ReadSet);

    FD_SET((SOCKET)_GenericFD, &ReadSet);

    timeval Timeout = {};
    // This select is not a bug on windows. The nSocks argument is ignored.
    uint32_t SelectResult = select(NULL, &ReadSet, NULL, NULL, &Timeout);

    if (SelectResult == SOCKET_ERROR)
        return cR<void>(false);

    return cR<void>(FD_ISSET((SOCKET)_GenericFD, &ReadSet) != 0);
}

cR<std::shared_ptr<iSocketBase>> cWindowsTCPServer::AcceptConnection()
{
    uint32_t TempFD = INVALID_SOCKET;

    sockaddr_in RemoteAddress = {};

    uint32_t AddrLength = sizeof(RemoteAddress);

    TempFD = (uint32_t)accept(_GenericFD, (sockaddr*)&RemoteAddress, (int*)&AddrLength);

    if (TempFD == INVALID_SOCKET)
        return cR<std::shared_ptr<iSocketBase>>(false);
    // This would not work for IPv6, but ipv4 is hardcoded in all clients that connect here...
    std::string Ip      = inet_ntoa(RemoteAddress.sin_addr);
    uint16_t    Port = ntohs(RemoteAddress.sin_port);

    return cR<std::shared_ptr<iSocketBase>>(true, std::make_shared<cWindowsTCPSocket>(TempFD, Ip, Port));
}

void cAPIServer::handle_server()
{
    while ((bool)_server_socket->HasConnectionsPending() == true)
    {
        auto accepted_client = _server_socket->AcceptConnection();

        std::thread(&cAPIServer::handle_client, this, accepted_client.Value()).detach();
    }
}

void cAPIServer::server_main()
{
    while (_is_running == true)
    {
        handle_server();

        std::this_thread::sleep_for(std::chrono::milliseconds(5));
    }
}

The client, server and SSH together cycle through unused ports at a rate of ~6 per second. But from everything I've read the Windows port exhaustion issue doesn't appear until a client utilizes about 33 connections per second. In perfmon and netstat there are never more than about 22 connections active at a time. I see only about 60 connections in a 'TIME_WAIT' state before they are reclaimed by the system. There's 64k ports available for connecting, so I don't think it's that.

The interval for it to show is always around 20 minutes. The port exhaustion issue would also only affect new connections. But in the screenshot it is clear that the first packet from the client that carries data is also re-transmitted twice, after the connection was established.

Have I made a mistake in my code? Or something I have missed?

Edit:

I've since ran the following experiments:

  • Run the client on the VM and the server on my host windows 10 machine. The results are the same.

  • Remove all other network adapters (such as OpenVPN), even though these were not active. Results are the same.

  • Reboot the system(s) involved. Results are the same.

  • Disable windows defender realtime scanning. Results are the same.

  • Open a ncat listener when I notice the packet loss occurring on another port, and connect. It seems more laggy than normal, but I didn't take the time to measure this accurately, so I might be wrong.

  • Run a netsh trace session and open the events (nothing special, but there was an enormous amount of events, so I could have easily missed something).

  • Disable mmp and profiles (some sort of TCP syn flood protection in Windows), which had no effect.

  • Connect the client directly to the server, instead of the SSH tunnel, same results.

Edit 2:

I've noticed a couple more things that deepen the mystery for me. If I terminate the server and client the moment packet loss occurs and then restart them the issue is still there. It must be some novel port exhaustion issue in Windows.

There's no persistence between restarts. No shared database, or shared configuration or any other similar thing. Neither the server of client re-use state from previous runs, so I think this confirms that the issue is not in my code. Even if I instruct the code to use a different port, the packets keep dropping.

0

There are 0 answers