What is the most efficient way to text URLs for 404 errors

276 views Asked by At

I'm interested to learn what is the best / leanest way to test URLs for server response codes such as 404s. I am currently using something very similar to what can be found in the comments of the php manual for get_headers:

<?php
function get_http_response_code($theURL) {
    $headers = get_headers($theURL);
    return substr($headers[0], 9, 3);
}

if(intval(get_http_response_code('filename.jpg')) < 400){
// File exists, huzzah!
}
?>

But, using this scaled for more than 50+ URLs in a foreach routine typically causes my server give up and report a 500 response (excuse vagueness on the exact error). So, I wonder if there is a method that is less resource heavy, and can check URL response codes on mass?

1

There are 1 answers

0
Rangad On

You could execute several curl requests at the same time using curl_multi_* functions.

However, this would still block execution until the slowest request returned (and some additional time for response parsing).

Tasks like this should be executed in the background using cronjobs or simliar alternatives.

Additonally there are multiple libraries on github and co.,which wrap the curl extension to provide a nicer api.

The concept resolves to this: (cpu "fix" by Ren@php-docs)

function getStatusCodes(array $urls, $useHead = true) {
    $handles = [];
    foreach($urls as $url) {
        $options = [
            CURLOPT_URL => $url,
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_NOBODY => $useHead,
            CURLOPT_FOLLOWLOCATION => true,
            CURLOPT_HEADER => 0
        ];
        $handles[$url] = curl_init();
        curl_setopt_array($handles[$url], $options);
    }

    $mh = curl_multi_init();

    foreach($handles as $handle) {
        curl_multi_add_handle($mh, $handle);
    }

    $running = null;
    do {
        curl_multi_exec($mh, $running);
        curl_multi_select($mh);
    } while ($running > 0);

    $return = [];
    foreach($handles as $handle) {
        $return[$eUrl = curl_getinfo($handle, CURLINFO_EFFECTIVE_URL)] = [
            'url' => $eUrl,
            'status' => curl_getinfo($handle, CURLINFO_HTTP_CODE) 
        ];
        curl_multi_remove_handle($mh, $handle);
        curl_close($handle);
    }
    curl_multi_close($mh);

    return $return; 
}

var_dump(getStatusCodes(['http://google.de', 'http://stackoverflow.com', 'http://google.de/noone/here']));