Extracting parts of an html code

50 views Asked by At

Let's say I had the below HTML code:

<p>Test text</p>
<p><img src="test.jpg" /></p>
<div id="test"><p>test</p></div>
<div class="block">
    <img src="test2.jpg">
</div>
<p>test</p>

Parameters:

  • There will exist a div block with class "block"
  • There can be any amount of HTML code above or below the div block with class "block"
  • There could even be two div blocks with class "block"

I was using PHP's XPath to look at this HTML code using DOM. I want to be able to return two things:

  1. The div block with class "block"
  2. All the rest of the code without the div element with class "block" in it

Something like:

Block Code:

<div class="block">
    <img src="test2.jpg">
</div>

Original without block code:

<p>Test text</p>
<p><img src="test.jpg" /></p>
<div id="test"><p>test</p></div>
<p>test</p>
1

There are 1 answers

0
davidkonrad On

By using DOMDocument you can do it like this :

$content = '<p>Test text</p>'.
        '<p><img src="test.jpg" /></p>'.
        '<div id="test"><p>test</p></div>'.
        '<div class="block">'.
        '<img src="test2.jpg">'.
        '</div>'.
        '<p>test</p>';

$blocks = array();
$doc = new DOMDocument();
$doc->loadHTML($content);

$elements = $doc->getElementsByTagName("*");
foreach ($elements as $element) {
    if($element->hasAttributes()) {
        if ($element->getAttribute('class') == 'block') {
            //add block HTML to block array
            $blocks[]=$doc->saveHTML($element);
            //remove blocck element
            $element->parentNode->removeChild($element);
        }
    }
}

echo '<pre>';
echo $blocks[0]; //iterate or print_r if multiple blocks
echo $doc->saveHTML();
echo '</pre>';

outputs the "block code" :

<div class="block"><img src="test2.jpg"></div>

and the "original without block code" :

<p>Test text</p><p><img src="test.jpg"></p><div id="test"><p>test</p></div><p>test</p>

If you simply cant accept that DOMDocument "enriches" the HTML with doctype, html and body, which can be very annoying when you want the complete document, not just some extracts, you can use this neat function and extract the body innerHTML with :

echo DOMinnerHTML($doc->getElementsByTagName('body')->item(0));