Flattening HTML documents using Jsoup

121 views Asked by At

HTML documents are hierarchical and can be parsed into DOM trees using Jsoup.

Is there a way to extract semantic "sections" in those documents, using pattern matching, where each match denotes the start of a "section" and the end of its previous one, and where sections can have subsections, ad infinitum?

The main difficulty here is that the HTML text belonging to a "section" start is not necessarily valid HTML (e.g., in the case where a section start is nested inside other tags). A traversal which extracts all the HTML content of a "section" and its direct "children" (subsections) would be the desired output.

Note that the problem can be reduced to extracting the content in between two HTML tags (the start of a section, inclusive and the start of the ned section, exclusive) since, even if the pattern correctly matches some random text in the document, its first surrounding HTML tag can be used.

Is there any way of doing this in Jsoup, i.e. given 2 Nodes to extract the HTML in between, irrespective of the hierarchical (nesting) level they belong to?

The following example uses tag matching to delineate semantic "sections", limited to HTML headers (e.g. <h1>) for simplicity. The hierarchy of "sections" is:

{Flattening HTML Documents [
    {Introduction},
    {Methodology [
      {Recursion [
        {First Approach}, {Second Approach}]
      },
      {Tree Traversal [
       {Depth-First Search}, {Breadth-First Search}]
      }
    },
    {Conclusion}
}

Here is the raw HTML.

<html>
  <head><title>Flattening HTML Documents</title></head>
  <body>
<h1>Flattening HTML Documents</h1>
    The requirement is to read each document in memory and extract its "sections",<br/>
    in sequential order, keeping track of subsections, in a tree-like manner.
    <div>
      <h2>Introduction</h2>
      Flattening HTML documents using <em><u>predefined</u> tag</em> values<br/>
      to mark the start of a section, which is also the end of the previous section.
    </div>
    <div>
      <h2>Methodology</h2>
      <p>What would be the optimal way of doing this?</p>
      <ul>
        <li>
          <h3>Recursion</h3>
          One method is <strong>recursion</strong>. But how do we keep state (section limits)?
          <ul>
            <li><h4>First Approach</h4><p>Pass state via method arguments</p></li>
            <li><h4>Second Apporach</h4><p>Pass state via method return values</p></li>
          </ul>
          <p>There are also <strong>tree-based</strong> methods.</p>
        </li>
        <li>
          <h3>Tree traversal</h3>
          Another method is <strong>tree traversal</strong>. But how do we keep state (section limits)?
          <ol>
            <li><h4>Depth-First Search</h4><p>Options: <b>preorder</b>, <b>inorder</b>, <b>postorder</b></p></li>
            <li><h4>Breadth-First Search</h4><p>Just <b>BFS</b>.</li>
          </ol>
        </li>
      </ul>
    </div>
    <div>
      <h2>Conclusion</h2>
      <p>Flattening (shredding) an <strong>HTML</strong> document using predefined tags<br/>
      (e.g., HTML header tags like &lt;h1&gt;) is a fascinating problem.</p>
    </div>  
  </body>
</html>
1

There are 1 answers

0
Maciej Dobrowolski On

I believe it could be handled, to some extent, by css selectors, like this one:

:has(:is(h1,h2,h3,h4,h5) ~ p)

This would return a flat list of elements containing a header and a sibling paragraph.

If you want to maintain a hierarchical layout in your app model, you would have to recursively iterate over the results of the above and execute the very same selector against each (to get their inner sections). Or simply traverse up the tree to see what is a parent of what.