Investigating slow startup of a gevent-based server application

Fast iteration time is critical when developing new things, and everything is fine when the server takes half a second to start. But when that server takes ten seconds to start, that's when things get annoying. So annoying that I had no choice but to spend several hours digging around for a reason for that slowdown.

The server in question was a Python/gevent/pywsgi server kinds of which I have been using quite some time now. And this was a new problem, one I had not previously encountered before: of course I wanted to get to the bottom of this.

First I tried to place some strategic print statements here and there, but those didn't help. Next I fired up the debugger and suspended the process during the startup. Nothing low-level blocking socket creation, and gevent has happily running its event loop; can't be its fault. Gevent has always done a great job not blocking anything, so that couldn't be it. (This is where I made a 'mistake', see the last paragraph.)

I was determined that it was something low-level, so it was time bring out the big guns: API Monitor. It could log every low-level API call, and then it would just be matter of digging through them all. And there indeed was some digging to be done. Then finally I found what I was looking for. The gevent event loop really did spin as it should, but the socket began processing data only after a gethostbyaddr-call returned.

That strange character is actually FE 80 00, rest of the address not rendering as a string.


This low-lever function was taking a long time to execute, and was actually executed in another thread, communicating its results via a socket. back to the main thread. For its argument it was given a link-local IPv6 address of (the first adapter from ipconfig) Hyper-V bridge adapter. Maybe due to some kind of misconfiguration or whatever that call took an excessive amount of time.

Now that I knew what was happening I wanted to know why. Intuition brought me to gevent's socket.py wrapper, where I inserted some tracking code to its implementation of gethostbyaddr. That in turn told me that as part of creating the socket the server's environmental variables were initialized. One of these was SERVER_NAME. If it was not already set, it was resolved via getfqdn - which called gethostname and that devil-ish gethostbyaddr. Only after the server name was resolved could the socket begin accepting connections.

Now that I had the general reason I didn't want to bother myself more that I had to. As the mechanism was already there, all it took was to pass environ={'SERVER_NAME': 'whatever'} as a kwargs to WSGIServer.

* * *

Had I though a bit more before letting go of the debugger and starting the API Monitor, I would probably have though about looking at the individual greenlet stacktraces. Those would have clearly told that the getfqdn call was blocking the main server greenlet.

No comments: