Truncating HTML content at the end of text blocks (block elements)

1.6k views Asked by At

Mainly when we shorten/truncate textual content we usually just truncate it at specific character index. That's already complicated in HTML anyway, but I want to truncate my HTML content (generated using content-editable div) using different measures:

  1. I would define character index N that will serve as truncation startpoint limit
  2. Algorithm will check whether content is at least N characters long (text only; not counting tags); if it's not, it will just return the whole content
  3. It would then check from N-X to N+X character position (text only) and search for ends of block nodes; X is predefined offset value and likely about N/5 to N/4;
  4. If several block nodes end within this range, algorithm will select the one that ends closest to limit index N
  5. If no block node ends within this range it would then find closest word boundary within the same range and select index closest to N and truncate at that position.
  6. Return truncated content with valid HTML (all tags closed at the end)

My content-editable generated content may consist of paragraphs (with line breaks), preformatted code blocks, block quotes, ordered and unordered lists, headers, bolds and italics (which are inline nodes and shouldn't count in truncation process) etc. Final implementation will of course define which elements specifically are possible truncation candidates. Headers even though they are block HTML elements will not count as truncation points as we don't want widowed headers. Paragraphs, list individual items, whole ordered and unordered lists, block quotes, preformatted blocks, void elements etc. are good ones. Headers and all inline block elements aren't.

Example

Let's take this very stackoverflow question as an example of HTML content that I would like to truncate. Let's set truncation limit to 1000 with offset of 250 characters (1/4).

This DotNetFiddle shows text of this question while also adding limit markers inside of it (|MIN| which represents character 750, |LIMIT| representing character 1000 and |MAX| that represents character 1250).

As can be seen from example the closest truncation boundary between two block nodes to character 1000 is between </OL> and P (My content-editable generated...). This means that my HTML should be truncated right between these two tags which would result in a little bit less than 1000 characters long content text wise, but kept truncated content meaningful because it wouldn't just truncate somewhere in the middle of some text passage.

I hope this explains how things should be working related to this algorithm.

The problem

The first problem I'm seeing here is that I'm dealing with nested structure like HTML. I also have to detect different elements (only block elements and no inline ones). And last but not least I will have to only count certain characters in my string and ignore those that belong to tags.

Possible solutions

  1. I could parse my content manually by creating some object tree representing content nodes and their hierarchy
  2. I could convert HTML to something easier to manage like markdown and then simply search for closest new line to my provided index N and convert back to HTML
  3. Use something like HTML Agility Pack and replace my #1 parsing with it and then somehow use XPath to extract block nodes and truncate content

Second thoughts

  • I'm sure I could make it by doing #1 but it feels I'm reinventing the wheel.
  • I don't think there's any C# library for #2 so I should be doing HTML to Markdown manually as well or run i.e. pandoc as an external process.
  • I could use HAP as it's great at manipulating HTML, but I'm not sure whether my truncation would be simple enough by using it. I'm afraid the bulk of processing will still be outside HAP in my custom code

How should one approach such truncation algorithm? My head just seems to be too tired to come to a consensus (or solution).

3

There are 3 answers

1
Eduardo Ramos On

I will run over the whole DOM tree and keep counting the number of text chars that appear. Whenever I hit the limit (N) I will erase the extra characters of that text node and from there on I will just remove all text nodes.

I believe that is a safe way to keep all HTML+CSS structure while retaining only N characters.

10
Akash Kava On
   private void RemoveEmpty(HtmlNode node){
       var parent = node.Parent;
       node.Remove();
       if(parent==null)
           return;
       // remove parent if it is empty
       if(!parent.DescendantNodes.Any()){
           RemoveEmpty(parent);
       }
   }



private void Truncate(DocumentNode root, int maxLimit){

    var n = 0;
    HtmlTextNode lastNode = null;

    foreach(var node in root.DescendantNodes
         .OfType<HtmlTextNode>().ToArray()){
       var length = node.Text.Length;

       n+= length;
       if(n + length >= maxLimit){
            RemoveEmpty(node);
       }

    }
}

// you are left with only nodes that add up to your max limit characters.
5
Simon Mourier On

Here is some sample code that can truncate the inner text. It uses the recursive capability of the InnerText property and CloneNode method.

    public static HtmlNode TruncateInnerText(HtmlNode node, int length)
    {
        if (node == null)
            throw new ArgumentNullException("node");

        // nothing to do?
        if (node.InnerText.Length < length)
            return node;

        HtmlNode clone = node.CloneNode(false);
        TruncateInnerText(node, clone, clone, length);
        return clone;
    }

    private static void TruncateInnerText(HtmlNode source, HtmlNode root, HtmlNode current, int length)
    {
        HtmlNode childClone;
        foreach (HtmlNode child in source.ChildNodes)
        {
            // is expected size is ok?
            int expectedSize = child.InnerText.Length + root.InnerText.Length;
            if (expectedSize <= length)
            {
                // yes, just clone the whole hierarchy
                childClone = child.CloneNode(true);
                current.ChildNodes.Add(childClone);
                continue;
            }

            // is it a text node? then crop it
            HtmlTextNode text = child as HtmlTextNode;
            if (text != null)
            {
                int remove = expectedSize - length;
                childClone = root.OwnerDocument.CreateTextNode(text.InnerText.Substring(0, text.InnerText.Length - remove));
                current.ChildNodes.Add(childClone);
                return;
            }

            // it's not a text node, shallow clone and dive in
            childClone = child.CloneNode(false);
            current.ChildNodes.Add(childClone);
            TruncateInnerText(child, root, childClone, length);
        }
    }

And a sample C# console app that will scrap this question as an example, and truncate it to 500 characters.

  class Program
  {
      static void Main(string[] args)
      {
          var web = new HtmlWeb();
          var doc = web.Load("http://stackoverflow.com/questions/30926684/truncating-html-content-at-the-end-of-text-blocks-block-elements");
          var post = doc.DocumentNode.SelectSingleNode("//td[@class='postcell']//div[@class='post-text']");
          var truncated = TruncateInnerText(post, 500);
          Console.WriteLine(truncated.OuterHtml);
          Console.WriteLine("Size: " + truncated.InnerText.Length);
      }
  }

When ran it, it should display this:

<div class="post-text" itemprop="text">

<p>Mainly when we shorten/truncate textual content we usually just truncate it at specific character index. That's already complicated in HTML anyway, but I want to truncate my HTML content (generated using content-editable <code>div</code>) using different measures:</p>

<ol>
<li>I would define character index <code>N</code> that will serve as truncating startpoint <em>limit</em></li>
<li>Algorithm will check whether content is at least <code>N</code> characters long (text only; not counting tags); if it's not it will just return the whole content</li>
<li>It would then</li></ol></div>
Size: 500

Note: I have not truncated at word boundary, just at character boundary, and no, it's not at all following the suggestions in my comment :-)