First,
the set up:
I have a script that executes several tasks after a user hits the "upload" button that sends the script the data it need. Now, this part is currently mandatory, we don't have the option at this point to cut out the upload and draw from a live source.
This section intentionally long-winded to make a point. Skip ahead if you hate that
Right now the data is parsed from a really funky source using regex, then broken down into an array. It then checks the DB for any data already in the uploaded data's date range. If the data date ranges don't already exist in the DB, it inserts the data and outputs success to the user (there is also some security checks, data source validation, and basic upload validation)... If the data does exist, the script then gets the data already in the DB, finds the differences between the two sets, deletes the old data that doesn't match, adds the new data, and then sends an email to each person affected by these changes (one email per person with all relevant changes in said email, which is a whole other step). The email addresses are pulled by means of an LDAP search as our DB has their work email but the LDAP has their personal email which ensures they get the email before they come in the next day and get caught unaware. Finally, the data-uploader is told "Changes have been made, emails have been sent." which is really all they care about.
Now I may be adding a Google Calendar API that posts the data (when it's scheduling data) to the user's Google Calendar. I would do it via their work calendar, but I thought I'd get my toes wet with Google's API before dealing with setting up a WebDav system for Exchange.
</backstory>
Now!
The practical question
At this point, pre-Google integration, the script takes at most a second and a half to run. It's pretty impressive, at least I think so (the server, not my coding). But the Google bit, in tests, is SLOOOOW. We can probably fix that, but it raises the bigger question...
What is the best way to off-load some of the work after the user has gotten confirmation that the DB has been updated? This is the part he's most concerned with and the part most critical. Email notifications and Google Calendar updates are only there for the benefit of those affected by the upload, and if there is a problem with these notifications, he'll hear about it (and then I'll hear about it) regardless of the script telling him first.
So is there a way, for example, to run a cronjob that's triggered by a script's last execution? Can PHP create cronjobs with exec()
ability? Is there some normalized way of handling post-execution work that needs getting done?
Any advice on this is really appreciated. I feel like the scripts bloated-ness reflects my stage of development and the need for me to finally know how to do division-of-labor in web apps.
But I also get worried that this is not done, as user's need to know when all tasks are completed, etc. So this brings up:
The best-practices/more-subjective question
Basically, is there an idea that progress bars, real-time offloading, and other ways of keeping the user tethered to the script are --when combined with optimization of the code, of course-- the better, more-preferred method then simply saying "We're done with your part, if you need us, we'll be notifying users" etc etc.
Are there any BIG things to avoid (other than obviously not giving the user any feedback at all)?
Thanks for reading. The coding part is crucial, so don't feel obliged to cover the second part or forget to cover the coding part!
There are a number of ways to go about this. You could exec(), like the above says, but you could potentially run into a DoS situation if there are too many submit clicks. the pcntl extension is arguably better at managing processes like this. Check out this post to see a discussion (there are 3 parts).
You could use Javascript to send a second, ajax post that runs the appropriate worker script afterwards. By using ignore_user_abort() and sending a Content-Length, the browser can disconnect early, but your apache process will continue to run and process your data. Upside is no forkbomb potential, Downside is it will open more apache processes.
Yet another option is to use a cron in the background that looks at a process-queue table for things to do 'later' - you stick items into this table on the front end, remove them on the backend while processing (see Zend_Queue).
Yet another is to use a more distributed job framework like gearmand - which can process items on other machines.
It all depends on your overall capabilities and requirements.