Scraping Links on Webpage Need to Determine if they contain Img elements

Question

Scraping Links on Webpage Need to Determine if they contain Img elements

1k views Asked by Daniel Lee At 15 June 2015 at 15:54

I'm building a custom scraper for a project. I can currently scrape all of the links on a webpage, storing the HREF, and anchor text in a database. However I am getting stuck when trying to determine if the anchor element contains and image element.

Here is my code:

foreach($rows as $row) {
    $url = $row['url'];
    $dom = new DOMDocument;
    libxml_use_internal_errors(TRUE); //disable libxml errors
    $dom->loadHTML(file_get_contents($url));

    // Write source page, destination URL and anchor text to the database
    foreach($dom->getElementsByTagName('a') as $link) {
        $href = $link->getAttribute('href');
        $anchor = $link->nodeValue;
        $img = $link->getElementsByTagName('img');
        $imgalt = $img->getAttribute('alt');

I then write the data to the database. This works fine within $img and $imgalt but I really want to identify if the anchor contains an image and also if there is an alt attribute. I know the problem is how I am trying to select the image using getElementsByTagName. I have been Googling all day and trying lots of different suggestions but nothing seems to work. Is this even possible?

I have followed the instructions mentioned here.

There is some progress. I can echo the HTML of images within the anchor elements (if I just echo DOMinnerHTML($link)), but I still can't get the alt attribute. I keep getting "Call to a member function getAttribute() on a non-object".

Here is my code now:

foreach($dom->getElementsByTagName('a') as $link) {
        $href = $link->getAttribute('href');
        $anchor = $link->nodeValue;
        $imgdom = DOMinnerHTML($link);
        $imgalt = $imgdom->getAttribute('alt');
        if(isset($imgalt)){
            echo $imgalt;
        }

Original Q&A

There are 1 answers

**felipsmartins** · Accepted Answer · 2015-06-15T17:23:48+00:00

Well, I just can suppose you want something like this:

<?php

$html_fragment = <<<HTML
<html>
<head>
    <title></title>
</head>
<body>
<div id="container">
    <a href="#a">there is n image here</a>
    <a href="#b"><img src="path/to/image-b" alt="b: alt content"></a>
    <a href="#c"><img src="path-to-image-c"></a>
    <a href="#d"><img src="path-to-image-d" alt="c: alt content"></a>
</div>
</body>
</html>
HTML;


$dom = new DOMDocument();
@$dom->loadHTML($html_fragment);
$links = $dom->getElementsByTagName('a');

foreach ($links as $link) {
    # link contains image child?
    $imgs    = $link->getElementsByTagName('img');
    $has_img = $imgs->length > 0;

    if ($has_img) {     
        $has_alt = (bool) $imgs->item(0)->getAttribute("alt");
        # img element has alt attribute?
        if ($has_alt) {
            // do something...
        }
    } else {
        // do something...
    }
}

Remember, such as said in the PHP doc, DOMElement::getAttribute() returns the value of the attribute, or an empty string if no attribute with the given name is found. So in order to check if a node attribute exists, just check if return value is a empty string.

TechQA.

Scraping Links on Webpage Need to Determine if they contain Img elements

There are 1 answers

Related Questions in PHP

Related Questions in DOM

Related Questions in DOMDOCUMENT

Related Questions in GETELEMENTSBYTAGNAME

Related Questions in GETATTRIBUTE

Popular Questions

Popular Tags

Trending Questions