The size of gevent/eventlet pool with python-socketio in production

123 views Asked by At

Context

Right now I have a simple production setup of python-socketio served on top of eventlet according to the documentation.

This application interacts mainly with Redis using official redis-py package and right now the eventlet pool size is 2048( I would bet it is just a legacy number that was magically chosen a long time ago). Each handler performs quite a few calls to redis.(2-3 on average) It uses python 3.10, has redis==4.5.5 and runs on the latest version of python-socketio==5.10.0 and eventlet==0.33.0.

We can imagine a simplified version of server app as ⬇️

# server.py
import eventlet

eventlet.monkey_patch()

import socketio
import redis

eventlet_pool = eventlet.GreenPool(2048)

redis_client = redis.Redis(
    host="localhost",
    port=6379,
    socket_timeout=0.1
)

sio = socketio.Server(
    ping_timeout=60,
    ping_interval=60,
    debug=False,
    logging=False
)
app = socketio.WSGIApp(sio, socketio_path="socket.io")


@sio.event
def connect(sid, *args, **kwargs):
    print("Connected")
    return True


@sio.on("message")
def handle_message(sid, key, **kwargs):
    redis_client.get(key)
    return True


if __name__ == "__main__":
    eventlet.wsgi.server(
        eventlet.listen(("", 7777)),
        app,
        custom_pool=eventlet_pool,
        debug=False,
    )

Under a significant load, this application works fine if it is one user. It can handle thousands of "message" events. The only affected thing is gonna be back pressure on latency, which is fine.

Meanwhile, as soon as more users kick in - bottleneck feels immediately. It starts raising an error that the request to redis timed out.

redis.exceptions.TimeoutError: Timeout reading from socket
Traceback (most recent call last):
  File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/connection.py", line 210, in _read_from_socket
    data = self._sock.recv(socket_read_size)
  File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/eventlet/greenio/base.py", line 370, in recv
    return self._recv_loop(self.fd.recv, b'', bufsize, flags)
  File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/eventlet/greenio/base.py", line 364, in _recv_loop
    self._read_trampoline()
  File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/eventlet/greenio/base.py", line 332, in _read_trampoline
    self._trampoline(
  File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/eventlet/greenio/base.py", line 211, in _trampoline
    return trampoline(fd, read=read, write=write, timeout=timeout,
  File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/eventlet/hubs/__init__.py", line 159, in trampoline
    return hub.switch()
  File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/eventlet/hubs/hub.py", line 313, in switch
    return self.greenlet.switch()
TimeoutError: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/eventlet/hubs/hub.py", line 476, in fire_timers
    timer()
  File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/eventlet/hubs/timer.py", line 59, in __call__
    cb(*args, **kw)
  File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/eventlet/hubs/__init__.py", line 151, in _timeout
    current.throw(exc)
  File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/eventlet/greenthread.py", line 221, in main
    result = function(*args, **kwargs)
  File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/socketio/server.py", line 584, in _handle_event_internal
    r = server._trigger_event(data[0], namespace, sid, *data[1:])
  File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/socketio/server.py", line 609, in _trigger_event
    return self.handlers[namespace][event](*args)
  File "projects/isolated-socketio--python-3-10-4/tst.py", line 33, in handle_message
    redis_client.get(key)
  File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/commands/core.py", line 1801, in get
    return self.execute_command("GET", name)
  File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/client.py", line 1269, in execute_command
    return conn.retry.call_with_retry(
  File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/retry.py", line 49, in call_with_retry
    fail(error)
  File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/client.py", line 1273, in <lambda>
    lambda error: self._disconnect_raise(conn, error),
  File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/client.py", line 1259, in _disconnect_raise
    raise error
  File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/retry.py", line 46, in call_with_retry
    return do()
  File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/client.py", line 1270, in <lambda>
    lambda: self._send_command_parse_response(
  File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/client.py", line 1246, in _send_command_parse_response
    return self.parse_response(conn, command_name, **options)
  File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/client.py", line 1286, in parse_response
    response = connection.read_response()
  File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/connection.py", line 874, in read_response
    response = self._parser.read_response(disable_decoding=disable_decoding)
  File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/connection.py", line 347, in read_response
    result = self._read_response(disable_decoding=disable_decoding)
  File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/connection.py", line 357, in _read_response
    raw = self._buffer.readline()
  File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/connection.py", line 260, in readline
    self._read_from_socket()
  File ".pyenv/versions/3.10.4/envs/test/lib/python3.10/site-packages/redis/connection.py", line 223, in _read_from_socket
    raise TimeoutError("Timeout reading from socket")

Production Redis server is quite far from its limit (<10% CPU and RAM) and went through a sophisticated stress testing before I continued going deeper into python-socketio itself.

To reproduce this issue, I created the following client ⬇️ that just spawns one connections and sends a message every second.

# client.py
import random
import string
import time

import socketio

if __name__ == "__main__":
    with socketio.SimpleClient() as sio:
        sio.connect(url="ws://localhost:7777/", transports=['websocket'])
        print("Client connected.")
        for i in range(10):
            sio.emit("message", random.choice(string.ascii_letters))
            time.sleep(1)
        print("Finished.")

Questions

  1. What could be the recommendations to do the math and understand what would be the right amount of threads per the 1 process of python-socketio?

  2. Any other recommendations how to run it at scale? Probably the other underlying server implementation (e.g. asgi with uvicorn) could allow me to squeeze more performance and better control over scalability?

Results of debugging and my understanding of what is happening

After debugging, my conclusion is that it happens because simply the amount of green threads is so high that switching between threads leads some of them left without no chance to get control back for a long time. By the moment it gets control back, it times out.

Example:

When the redis client get is called in thread A, greenlet switches the control to the other green thread B, then thread C and so on... By the moment control comes back to A, it doesn't care whether Redis returned a result or not - it just raises a timeout. This behaviour is generally expected in this case, I think. Unless I misunderstood something, so let me know if I missconcepted some of the eventlet behaviours here.

How I think it could be solved

Right now my "theory" is having much less green threads per one process and instead horizontally scale the application more.

As of today, I am running on average 5 replicas of python-socketio with a setup of 2048 greenlets at max. I don't have a load of 10k users, as I am around 1k actively connected users at the peak times, each sending around 5 messages a second.

I think reducing the size of a pool to 50 per replica, removing the timeout on redis client and adding 20 replicas should do the job for my load absolutely fine with quite a gap for future growth.

0

There are 0 answers