PHP DOMDocument ignores first table's closing tag

38 views Asked by At

I was writing a tool to convert HTML tables to CSV and I noticed some bizarre behavior. Given this code

$html = <<<HTML
<table>
<tr><td>A</td><td>Rose</td></tr>
</table>

<h1>Leave me behind</h1>

<table>
<tr><td>By</td><td>Any</td></tr>
</table>

<table>
<tr><td>Other</td><td>Name</td></tr>
</table>
HTML;

$dom = new \DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);

foreach ($dom->getElementsByTagName('table') as $table) {
    foreach ($table->getElementsByTagName('tr') as $row) {
        echo trim($row->nodeValue) . PHP_EOL;
    }
}

I would expect output like this:

ARose
ByAny
OtherName

But what I get is this:

ARose
ByAny
OtherName
ByAny
OtherName

I get the same result if I omit the first closing tag. It appears DOMDocument is nesting the second and third <table> inside the first.

Indeed, if I use xpath to only get immediate children from each table I get the correct output:

$xpath = new \DOMXPath($dom);

foreach ($dom->getElementsByTagName('table') as $table) {
    foreach ($xpath->query('./tr', $table) as $row) {
        echo trim($row->nodeValue) . PHP_EOL;
    }
}
1

There are 1 answers

0
Ken Lee On BEST ANSWER

Enclose your $html with <body> and </body>

Revised Code (Note: I commented out the $stream lines)

<?php
$html = <<<HTML
<body>
<table>
<tr><td>A</td><td>Rose</td></tr>
</table>

<h1>Leave me behind</h1>

<table>
<tr><td>By</td><td>Any</td></tr>
</table>

<table>
<tr><td>Other</td><td>Name</td></tr>
</table>
</body>
HTML;

$dom = new \DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);

$tables = $dom->getElementsByTagName('table');
// $stream = \fopen('php://output', 'w+');

for ($i = 0; $i < $tables->length; ++$i) {
    $rows = $tables->item($i)->getElementsByTagName('tr');

    for ($j = 0; $j < $rows->length; ++$j) {
        echo trim($rows->item($j)->nodeValue) . "<br><br>";
    }
}

// fclose($stream);
?>

Alternatively, change

$dom->loadHTML($html, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);

to

$dom->loadHTML($html, LIBXML_HTML_NODEFDTD);