Process curl_multi_exec results while in progress?

79 views Asked by At

I am building a simple web spider using built-in PHP cURL multi. It works great. Here is the basic implementation:

I am building a simple web spider using built-in PHP cURL multi. It works great. Here is the basic implementation:

<?php
$remainingTargets = ...;
$concurrency = 30;

$multiHandle = curl_multi_init();
$targets = [];
while (count($targets) < $concurrency && count($remainingTargets) > 0) {
  $target = array_shift($remainingTargets);
  $alreadyChecked = ...;
  if ($alreadyChecked !== false) {
    continue;
  }
  $curl = curl_init($target);
  curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36');
  curl_setopt($curl, CURLOPT_FAILONERROR, true);
  curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
  curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 4);
  curl_setopt($curl, CURLOPT_TIMEOUT, 5);
  curl_multi_add_handle($multiHandle, $curl);
  $targets[$target] = $curl;
}

// Run loop for downloading
$running = null;
do {
  curl_multi_exec($multiHandle, $running);
} while ($running);

// Harvest results
foreach ($targets as $target => $curl) {
  $html = curl_multi_getcontent($curl);
  curl_multi_remove_handle($multiHandle, $curl);
  // Process this page
}
curl_multi_close($multiHandle);

// If done show results, or continue processing queue...

But I want to know, is it possible to do the harvesting in the "run loop" here? I imagine that would free up resources faster and run better. It seems like I want a c-style select. But curl_multi_select does not return a specific resource.

1

There are 1 answers

0
chris On

I know this is old but answering as I had the same question:

The solution seems to be to use curl_multi_info_read which will return an array containing a completed transfer.

$mh = curl_multi_init();

// Add CurlHandles to CurlMultiHandle
foreach ([
    'https://example.com',
    'https://example.net',
    'https://example.org',
] as $url) {
    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_multi_add_handle($mh, $ch);
}

do {
    // Run sub-connections
    curl_multi_exec($mh, $running);

    // Wait for activity on CurlMultiHandle
    curl_multi_select($mh);

    // Consume any completed transfers
    while ($curlMultiInfoRead = curl_multi_info_read($mh)) {
        // Check CurlHandle has not had an error
        if ($curlMultiInfoRead['result'] !== CURLE_OK) {
            throw new \RuntimeException(curl_error($curlMultiInfoRead['handle']));
        }

        // Get information on the request
        $curlGetInfo = curl_getinfo($curlMultiInfoRead['handle']);
        echo $curlGetInfo['http_code'].'<br>';
        echo $curlGetInfo['url'].'<br>';

        // Get contents of the request etc.
        $curlMultiGetContent = curl_multi_getcontent($curlMultiInfoRead['handle']);
        echo htmlentities(substr($curlMultiGetContent, 0, 100)).'<br>';

        // Close this CurlHandles and remove it from CurlMultiHandle
        curl_close($curlMultiInfoRead['handle']);
        curl_multi_remove_handle($mh, $curlMultiInfoRead['handle']);
    }
} while ($running > 0);

This can be particularly useful when combined with CURLMOPT_MAX_TOTAL_CONNECTIONS which will limit the total active connections and a Generator to yield each curl response as it occurs.