How to parse PCDATA and child element separately with PHP DOM?

128 views Asked by At

I'm trying to parse an XML of a dtbook, which contains levels (1, 2 and 3) that later on contains p-tags. I'm doing this with PHP DOM. Link to XML

Inside som of these p-tags there are noteref-tags. I do get a hold of those, but it seems that the only results I'm able to get is either that the noteref appears before the p-tag, or after. I need some of the noterefs to appear inside the p-tag; or in other words, where they actually are supposed to be.

<p>Special education for the ..... <noteref class="endnote" idref="fn_5"
id="note5">5</noteref>. Interest ..... 19th century <noteref class="endnote"
idref="fn_6" id="note6">6</noteref>.</p>

This is the code I've got for the p-tag now. Before this, I'm looping through the dt-book to get tho the p-tag. That works fine.

if($level1->tagName == "p") {
    echo "<p>".$level1->nodeValue;
    $noterefs = $level1->childNodes;
    foreach($noterefs as $noteref) {
        if($noteref->nodeType == XML_ELEMENT_NODE) {
            echo "<span><b>".$noteref->nodeValue."</b></span>";
        }
    }  
    echo "</p><br>";
}

These are the results I get:

Special education for the ..... 5. Interest ..... 19th century 6.56

56Special education for the ..... 5. Interest ..... 19th century 6.

I also want the p-tag to not display what's inside the noteref-tag. That should be done by the noteref-tag (only).

So, does anybody know what could possibly be done to fix these things? It feels like I've both googled and tried almost everything.

1

There are 1 answers

2
Decent Dabbler On BEST ANSWER

DOMNode->nodeValue (which in PHP's DOMElement is the same as DOMNode->textContent) will contain the complete text content from itself and all its descending nodes. Or, to put it a little more simple: it contains the complete content of the node, but with all tags removed.

What you probably want to try is the something like the following (untested):

if($level1->tagName == "p") {
    echo "<p>";
    // loop through all childNodes, not just noteref elements
    foreach($level1->childNodes as $childNode) {
      // you could also use if() statements here, of course
      switch($childNode->nodeType) {
        // if it's just text
        case XML_TEXT_NODE:
          echo $childNode->nodeValue;
        break;
        // if it's an element
        case XML_ELEMENT_NODE:
          echo "<span><b>".$childNode->nodeValue."</b></span>";
        break;
      }
    }  
    echo "</p><br>";
}

Be aware though that this is still rather flimsy. For instance: if any other elements, besides <noteref> elements, show up in the <p> elements, they will also be wrapped in <span><b> elements.

Hopefully I've at least given you a clue as to why your result <p> elements showed the contents of the child elements as well.


As a side note: if what you want to achieve is transform the contents of an XML document into HTML or perhaps some other XML structure, it might pay off to look into XSLT. Be aware though that the learning curve could be steep.