I've run into a weird problem, and hope that someone can help me out. I've written a multiCurl spider in PHP that scrapes keywords off websites, and I'm running into a strange performance problem.
When I run the spider to scrape the first few levels of a site, it takes about 2 minutes to complete, which isn't that big a problem for my purposes. What's strange is that when I try to run one spider after another in the same script, the runtime balloons for some reason. For example, when I want it to sequentially run on 7 sites, I'd expect it to take 14 minutes (2 minutes per site), but instead it's taking upwards of 45 minutes to run. I've tested each of the sites separately and they are in fact averaging at 2 minutes or below apiece, but when run in sequence it takes almost an hour.
I thought it might be something to do with memory issues, so I implemented APC cache to store the keyword data while the script is running. The thing is, when I look at my task manager (I'm running XAMPP on Windows 7) the Apache Server doesn't seem to go much higher than 46K/23% of the CPU, and everything else on my compy runs just fine.
I've taken a close look and made sure all the appropriate handlers are closed as soon as possible, large variables are unset/cached, and yet I'm still scratching my head as to why it's taking 3 times longer than expected to run one site after the other. I'm about to go the route of trying to fork the spiders to separate processes using pcntl (I'm going to try a thumb drive install of linux), but I was wondering if anyone might have any ideas of what might be giving my application the performance hit. Thanks!