PHP: Reduce load of function that gets <title> content from external resource

280 views Asked by At

I created a function that checks if the <title> tag of an external page contains specific words (between the others of the title). If check is positive it should echo the (whole) page <title>.

<?php

function file_get_contents_curl($url)
{
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

    $data = curl_exec($ch);
    curl_close($ch);

    return $data;
}

$html = file_get_contents_curl("http://www.lastfm.it/user/lorenzone92/now");

$doc = new DOMDocument();
@$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');

$title = $nodes->item(0)->nodeValue;

if (strpos($title,'in ascolto') !== false) {
echo "$title". '<br>';
}

?>

It is working fine. My concern is about memory consumption and server load. The problem is that I cannot cache the $html because it's a live thing.. any idea? Do I need to grab the whole page to just access the <title>? Other methods instead of cURL and file_get_contents to reduce server load? Or I'm just overconcerned..? :)

Note: Don't worry about PHP version ( no limits, I'm on my VPS which has PHP 5.5.7 installed :D ).

3

There are 3 answers

2
Goikiu On

I do not know if it's helpful... but this other question (that seem related to yours) seem to have a lot of answers... here the link

Get title of website via link

0
iMx On

I guess you have to load the whole page. You don't know on what position and how long is the title tag so you can't read e.g. the first 1000 characters. I don't know how many pages you try to load at the same time, but you don't load the whole media data like images and css files, so your parsed HTML code should not be too large.

1
CodeZombie On

I simple way to load only a part of a site is the Range header:

Range:  bytes=0-499

If the server supports the Range header, it only returns the first 500 bytes. Unfortunately, this breaks the mark-up of the page which might result in errors when using DOMDocument. On the other hand, using DOMDocument is probably not the best idea when you only need the content of one HTML element. I recommend using a simple regex or basic string functions.