Investigating TCP timeouts

As hinted by an earlier post, one of my latest work projects was a building a WebRTC based video streaming system. This system features a Websocket backend server handling the client sessions and signaling, written in Python with Gevent. My co-worker was performing some testing and everything seemed to be working just fine. Except one point he encountered a situation, where the server insisted another client was already connected, when it clearly wasn’t. Even netstat said that the TCP connection was ‘established’.

Some Websocket client connections were arbitrarily staying open even if the browser was closed! I had just added some hooks to the server to better detect disconnected / timed-out clients and a good guess was that I had messed something up. Nope. Debugged the thing for hours but couldn’t find single bug.

That is, until I tried toggling network connections. Being a system targeted for mobile devices, one facet of testing is to check how well it works with different network connections. If the network connection was turned of while the client was connected, the server couldn’t detect that until after about ten to fifteen minutes, even though it was sending keep-alive packets every 20 seconds. Strange indeed.

But maybe it wasn’t, maybe the write call was blocking the specific greenlet? That is an easy thing to test, just dump some stack traces. But nope again. How about if I run the server via strace and try to spot the writes there? It took bit of an effort, but the strace output revealed that the write calls were performed just fine! This is starting to be most troubling…

But then a revelation; write returned EPIPE. After quite a bit of research I had finally found the reason for this behavior: TCP timeouts. Turning off the network connection really did what it did. It turned off the connection without giving the protocol stack time to say its goodbyes. The server then though the client just had a really bad connection and tried resending the queued data with an exponential delay back off per TCP spec. My math didn’t quite match, but in an old thread the total timeout was calculated to be 924.6 seconds with the default TCP parameters. This was quite close to the actual timeout observed with the server.

* * *

I sighed and changed the keep-alive protocol so that timely replies were required instead of just relying on failed writes. Now it works beautifully.

Tl;dr: TCP was just really hard trying to resend data after it detected packet loss, only giving up when about fifteen whole minutes had passed.