PHP DomCrawler - content without tag (nodeName)

42 views Asked by At

While parsing HTML Content with help of the PHP (Symfony) DomCrawler Library like:

$html = <<<'HTML'
<!DOCTYPE html>
<html>
    <body>
       <div class="content">
          <p class="message">Hello World!</p>
          !!!This content is not processed by DomCrawler as Children!!!
          <p>Hello Crawler!</p>
        </div>
    </body>
</html>
HTML;

$crawler = new Crawler($html);
$content = $crawler->filterXPath('descendant-or-self::body/div[@class="content"]');
foreach ($content->children() as $contentChild) {
  // There ar 2x iterations, missing the middle text - without tag (nodeName)
}

the middle content "!!!This content is not processed by DomCrawler as Children!!!" is not parsed in the loop and only the content with valid Tag is accepted. It might be a minor configuration needed to achieve this. Anyone knows how to fix this issue and be able to have a DomNode also for the text with no HTML Tag?

Looking forward for any hint/help, thank you in advance.

1

There are 1 answers

0
Jim On

Looking at the code for DomCrawler, they appear to be filtering only elements, which goes against providing nodes (maybe a bug in their implementation or documentation?). This is technically incorrect, but you can get around it by modifying your xpath expression to look for all child nodes instead:

$html = <<<'HTML'
<!DOCTYPE html>
<html>
    <body>
       <div class="content">
          <p class="message">Hello World!</p>
          !!!This content is not processed by DomCrawler as Children!!!
          <p>Hello Crawler!</p>
        </div>
    </body>
</html>
HTML;
$crawler = new Crawler($html);
// End with /node() to return all child nodes (including text).
$content = $crawler->filterXPath('descendant-or-self::body/div[@class="content"]/node()');
foreach ($content as $contentChild) {
  // There are 5x iterations including
  // the empty strings / line breaks before
  // & after each element
}