Context
Right now I have a simple production setup of python-socketio
served on top of eventlet according to the documentation.
This application interacts mainly with Redis using official redis-py
package and right now the eventlet pool size is 2048( I would bet it is just a legacy number that was magically chosen a long time ago). Each handler performs quite a few calls to redis.(2-3 on average)
It uses python 3.10, has redis==4.5.5 and runs on the latest version of python-socketio==5.10.0 and eventlet==0.33.0.
We can imagine a simplified version of server app as ⬇️
# server.py
import eventlet
eventlet.monkey_patch()
import socketio
import redis
eventlet_pool = eventlet.GreenPool(2048)
redis_client = redis.Redis(
host="localhost",
port=6379,
socket_timeout=0.1
)
sio = socketio.Server(
ping_timeout=60,
ping_interval=60,
debug=False,
logging=False
)
app = socketio.WSGIApp(sio, socketio_path="socket.io")
@sio.event
def connect(sid, *args, **kwargs):
print("Connected")
return True
@sio.on("message")
def handle_message(sid, key, **kwargs):
redis_client.get(key)
return True
if __name__ == "__main__":
eventlet.wsgi.server(
eventlet.listen(("", 7777)),
app,
custom_pool=eventlet_pool,
debug=False,
)
Under a significant load, this application works fine if it is one user. It can handle thousands of "message" events. The only affected thing is gonna be back pressure on latency, which is fine.
Meanwhile, as soon as more users kick in - bottleneck feels immediately. It starts raising an error that the request to redis timed out.
redis.exceptions.TimeoutError: Timeout reading from socket
Traceback (most recent call last):
File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/connection.py", line 210, in _read_from_socket
data = self._sock.recv(socket_read_size)
File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/eventlet/greenio/base.py", line 370, in recv
return self._recv_loop(self.fd.recv, b'', bufsize, flags)
File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/eventlet/greenio/base.py", line 364, in _recv_loop
self._read_trampoline()
File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/eventlet/greenio/base.py", line 332, in _read_trampoline
self._trampoline(
File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/eventlet/greenio/base.py", line 211, in _trampoline
return trampoline(fd, read=read, write=write, timeout=timeout,
File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/eventlet/hubs/__init__.py", line 159, in trampoline
return hub.switch()
File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/eventlet/hubs/hub.py", line 313, in switch
return self.greenlet.switch()
TimeoutError: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/eventlet/hubs/hub.py", line 476, in fire_timers
timer()
File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/eventlet/hubs/timer.py", line 59, in __call__
cb(*args, **kw)
File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/eventlet/hubs/__init__.py", line 151, in _timeout
current.throw(exc)
File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/eventlet/greenthread.py", line 221, in main
result = function(*args, **kwargs)
File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/socketio/server.py", line 584, in _handle_event_internal
r = server._trigger_event(data[0], namespace, sid, *data[1:])
File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/socketio/server.py", line 609, in _trigger_event
return self.handlers[namespace][event](*args)
File "projects/isolated-socketio--python-3-10-4/tst.py", line 33, in handle_message
redis_client.get(key)
File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/commands/core.py", line 1801, in get
return self.execute_command("GET", name)
File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/client.py", line 1269, in execute_command
return conn.retry.call_with_retry(
File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/retry.py", line 49, in call_with_retry
fail(error)
File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/client.py", line 1273, in <lambda>
lambda error: self._disconnect_raise(conn, error),
File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/client.py", line 1259, in _disconnect_raise
raise error
File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/retry.py", line 46, in call_with_retry
return do()
File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/client.py", line 1270, in <lambda>
lambda: self._send_command_parse_response(
File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/client.py", line 1246, in _send_command_parse_response
return self.parse_response(conn, command_name, **options)
File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/client.py", line 1286, in parse_response
response = connection.read_response()
File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/connection.py", line 874, in read_response
response = self._parser.read_response(disable_decoding=disable_decoding)
File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/connection.py", line 347, in read_response
result = self._read_response(disable_decoding=disable_decoding)
File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/connection.py", line 357, in _read_response
raw = self._buffer.readline()
File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/connection.py", line 260, in readline
self._read_from_socket()
File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/connection.py", line 223, in _read_from_socket
raise TimeoutError("Timeout reading from socket")
Production Redis server is quite far from its limit (<10% CPU and RAM) and went through a sophisticated stress testing before I continued going deeper into python-socketio itself.
To reproduce this issue, I created the following client ⬇️ that just spawns one connections and sends a message every second.
# client.py
import random
import string
import time
import socketio
if __name__ == "__main__":
with socketio.SimpleClient() as sio:
sio.connect(url="ws://localhost:7777/", transports=['websocket'])
print("Client connected.")
for i in range(10):
sio.emit("message", random.choice(string.ascii_letters))
time.sleep(1)
print("Finished.")
Questions
What could be the recommendations to do the math and understand what would be the right amount of threads per the 1 process of python-socketio?
Any other recommendations how to run it at scale? Probably the other underlying server implementation (e.g. asgi with uvicorn) could allow me to squeeze more performance and better control over scalability?
Results of debugging and my understanding of what is happening
After debugging, my conclusion is that it happens because simply the amount of green threads is so high that switching between threads leads some of them left without no chance to get control back for a long time. By the moment it gets control back, it times out.
Example:
When the redis client get
is called in thread A
, greenlet switches the control to the other green thread B
, then thread C
and so on... By the moment control comes back to A
, it doesn't care whether Redis returned a result or not - it just raises a timeout.
This behaviour is generally expected in this case, I think. Unless I misunderstood something, so let me know if I missconcepted some of the eventlet behaviours here.
How I think it could be solved
Right now my "theory" is having much less green threads per one process and instead horizontally scale the application more.
As of today, I am running on average 5 replicas of python-socketio with a setup of 2048 greenlets at max. I don't have a load of 10k users, as I am around 1k actively connected users at the peak times, each sending around 5 messages a second.
I think reducing the size of a pool to 50 per replica, removing the timeout on redis client and adding 20 replicas should do the job for my load absolutely fine with quite a gap for future growth.