Manual Post Save Signal creation makes the application slow, Django

803 views Asked by At

We have a Django application that uses Django-river for workflow management. For performance improvement, we had to use bulk_create. We need to insert data into a couple of tables and several rows in each. Initially, we were using the normal .save() method and the workflow was working as expected (as the post save() signals were creating properly). But once we moved to the bulk_create, the performance was improved from minutes to seconds. But the Django_river stopped working and there was no default post save signals. We had to implement the signals based on the documentation available.

class CustomManager(models.Manager):
    def bulk_create(items,....):
         super().bulk_create(...)
         for i in items:
              [......] # code to send signal

And

class Task(models.Model):
    objects = CustomManager()
    ....

This got the workflow working again, but the generation of signals is taking time and this destroys all the performance improvement gained with bulk_create. So is there a way to improve the signal creation?

More details

def post_save_fn(obj):
    post_save.send(obj.__class__, instance=obj, created=True) 

class CustomManager(models.Manager):
    def bulk_create(self, objs, **kwargs):
        #Your code here
        data_obj = super(CustomManager, self).bulk_create(objs,**kwargs)
        for i in data_obj:
            # t1 = threading.Thread(target=post_save_fn, args=(i,))
            # t1.start()
            post_save.send(i.__class__, instance=i, created=True) 
        return data_obj
        
        
class Test(Base): 
    test_name = models.CharField(max_length=100)
    test_code = models.CharField(max_length=50)
    objects = CustomManager()
    class Meta:
        db_table = "test_db"
2

There are 2 answers

0
tim-mccurrach On BEST ANSWER

What is the problem?

As others have mentioned in the comments, the problem is that the functions that are getting called via the post_save are taking a long time. (Remember that signals are not async!! - this is a common misconception).

I'm not familiar with django-river but taking a quick look at the functions that will get called post-save (see here and here) we can see that they involve additional calls to the database.

Whilst you save a lot of individual db hits by using bulk_create you are still doing calling the database again multiple times for each post_save signal.

What can be done about it?

In short. Not much!! For the vast majority of django requests, the slow part will be calling the database. This is why we try and minimise the number of calls to the db (using things like bulk_create).

Reading through the first few paragraphs of django-river the whole idea is to move things that would normally be in code to the database. The big advantage here is that you don't need to re-write code and re-deploy so often. But the disadvantage is that you're inevitably going to have to refer to the database more, which is going to slow things down. This will be fine for some use-cases, but not all.

There are two things I can think of which might help:

  • Does all of this currently happen as part of the request/response cycle. And if it is, does it need to be? If the answers to these two questions are 'yes' and 'no' respectively, then you could move this work to a separate task queue. This will still be slow, but at least it won't slow down your site.
  • Depending on exactly what your workflows are and the nature of the data you are creating, it might be the case that you can do everything that the post_save signals are doing in your own function, and do it more efficiently. But this will definitely depend upon your data, and your app, and will move away from the philosophy of django-river.
0
fanni On

Use a separated worker if the "signal" logic allows you to be executed after the bulk save.

You can create an additional queue table and put the metadata about what to do for your future worker.

Create a separated worker (Django module) with needed logic and data from the queue table. You can do it as management command, this will allow you to run the worker in the main flow (you can run management commands from regular Django code) or you can run it by crontab based on a schedule.

How to run such a worker?

If you need something to be done as closely as you've created records - run it in a separate thread using the threading module. So your request-response lifecycle will be done right after you've started a new thread.

Else if you can do it later - make a schedule and run it by crontab using the management command framework.