I'm building a custom scraper for a project. I can currently scrape all of the links on a webpage, storing the HREF, and anchor text in a database. However I am getting stuck when trying to determine if the anchor element contains and image element.
Here is my code:
foreach($rows as $row) {
$url = $row['url'];
$dom = new DOMDocument;
libxml_use_internal_errors(TRUE); //disable libxml errors
$dom->loadHTML(file_get_contents($url));
// Write source page, destination URL and anchor text to the database
foreach($dom->getElementsByTagName('a') as $link) {
$href = $link->getAttribute('href');
$anchor = $link->nodeValue;
$img = $link->getElementsByTagName('img');
$imgalt = $img->getAttribute('alt');
I then write the data to the database. This works fine within $img and $imgalt but I really want to identify if the anchor contains an image and also if there is an alt attribute. I know the problem is how I am trying to select the image using getElementsByTagName. I have been Googling all day and trying lots of different suggestions but nothing seems to work. Is this even possible?
I have followed the instructions mentioned here.
There is some progress. I can echo the HTML of images within the anchor elements (if I just echo
DOMinnerHTML($link)
), but I still can't get the alt attribute. I keep getting "Call to a member function getAttribute()
on a non-object".
Here is my code now:
foreach($dom->getElementsByTagName('a') as $link) {
$href = $link->getAttribute('href');
$anchor = $link->nodeValue;
$imgdom = DOMinnerHTML($link);
$imgalt = $imgdom->getAttribute('alt');
if(isset($imgalt)){
echo $imgalt;
}
Well, I just can suppose you want something like this:
Remember, such as said in the PHP doc, DOMElement::getAttribute() returns the value of the attribute, or an empty string if no attribute with the given name is found. So in order to check if a node attribute exists, just check if return value is a empty string.