Eventlet/general async I/O task granularity

2k views Asked by At

I am working on a web backend / API provider that grabs realtime data from a 3rd party web API, puts it in a MySQL database and makes it available over an HTTP/JSON API.

I am providing the API with flask and working with the DB using SQLAlchemy Core.

For the realtime data grabbing part, I have functions that wrap the 3rd party API by sending a request, parsing the returned xml into a Python dict and returning it. We'll call these API wrappers.

I then call these functions within other methods which take the respective data, do any processing if needed (like time zone conversions etc.) and put it in the DB. We'll call these processors.

I've been reading about asynchronous I/O and eventlet specifically and I'm very impressed.

I'm going to incorporate it in my data grabbing code, but I have some questions first:

  1. is it safe for me to monkey patch everything? considering I have flask, SQLAlchemy and a bunch of other libs, are there any downsides to monkey patching (assuming there is no late binding)?

  2. What is the granularity I should divide my tasks to? I was thinking of creating a pool that periodically spawns processors. Then, once the processor reaches the part where it calls the API wrappers, the API wrappers will start a GreenPile for getting the actual HTTP data using eventlet.green.urllib2. Is this a good approach?

  3. Timeouts - I want to make sure no greenthreads ever hang. Is it a good approach to set the eventlet.Timeout to 10-15 seconds for every greenthread?

FYI, I have about 10 different sets of realtime data, and a processor is spawned every ~5-10 seconds.

Thanks!

2

There are 2 answers

0
kimjxie On

It's safe to patch a module wrote by pure python and using standard lib.

  • there are few pure mysql adapters:
  • PyMysql has a sqlalchemy test suite, you could run the test for your cases.
  • There is a module named pymysql_sa to provide dialect for sqlalchemy
  • Flask is wrote by pure python and 100% WSGI 1.0 compliant. use eventlet.wsgi to provide the service.

Divide tasks by single fetch using green module as you can. Put the jobs into a queue, which also provided by eventlet, the every task worker fetch a job from the queue, then save the result into db after finish fetching, or send to a event.Event object to trigger the job which wait for the task finish.Or, both of the two processes.

UPDATED:

The eventlet official document strongly recommend use the patch at the fist line of the main module, and it's safe to call monkey_patch multiple times. Read more on page http://eventlet.net/doc/patching.html

There some green module can working with eventlet, all of them are in the eventlet.green. A list on bitbucket. Make sure use the green module in your code, or patch them before import 3th modules which use the standard libs.

But the monkey_patch only accept few module, it's necessary to import the green module manually.

def monkey_patch(**on):
    """Globally patches certain system modules to be greenthread-friendly.

    The keyword arguments afford some control over which modules are patched.
    If no keyword arguments are supplied, all possible modules are patched.
    If keywords are set to True, only the specified modules are patched.  E.g.,
    ``monkey_patch(socket=True, select=True)`` patches only the select and 
    socket modules.  Most arguments patch the single module of the same name 
    (os, time, select).  The exceptions are socket, which also patches the ssl 
    module if present; and thread, which patches thread, threading, and Queue.

    It's safe to call monkey_patch multiple times.
    """    
    accepted_args = set(('os', 'select', 'socket', 
                         'thread', 'time', 'psycopg', 'MySQLdb'))
    default_on = on.pop("all",None)
2
AudioBubble On

I don't think it's wise to mix Flask/SQLAlchemy with an asynchronous style (or event driven) programming model.

However, since you state that you are using a RDBMS (MySQL) as intermediary storage, why don't you just create asynchronous workers that store the results from your third party webservices in the RDMBS, and keep your frontend (Flask/SQLAlchemy) synchronous?

In that case you don't need to monkeypatch Flask or SQLAlchemy.

Regarding the granularity, you may want to use the mapreduce paradigm to perform the web API calls and processing. This pattern may give you some idea on how to logically seperate the consecutive steps, and how to control the processes involved.

Personally, I wouldn't use an asynchronous framework for doing this though. It may be better to use either multiprocessing, Celery, or a real mapreduce kind of system like Hadoop.

Just a hint: start small, keep it simple and modular and optimize later if you are requiring better performance. This may also be heavily influenced by how realtime you want the information to be.