Mainly when we shorten/truncate textual content we usually just truncate it at specific character index. That's already complicated in HTML anyway, but I want to truncate my HTML content (generated using content-editable div
) using different measures:
- I would define character index
N
that will serve as truncation startpoint limit - Algorithm will check whether content is at least
N
characters long (text only; not counting tags); if it's not, it will just return the whole content - It would then check from
N-X
toN+X
character position (text only) and search for ends of block nodes;X
is predefined offset value and likely aboutN/5
toN/4
; - If several block nodes end within this range, algorithm will select the one that ends closest to limit index
N
- If no block node ends within this range it would then find closest word boundary within the same range and select index closest to
N
and truncate at that position. - Return truncated content with valid HTML (all tags closed at the end)
My content-editable generated content may consist of paragraphs (with line breaks), preformatted code blocks, block quotes, ordered and unordered lists, headers, bolds and italics (which are inline nodes and shouldn't count in truncation process) etc. Final implementation will of course define which elements specifically are possible truncation candidates. Headers even though they are block HTML elements will not count as truncation points as we don't want widowed headers. Paragraphs, list individual items, whole ordered and unordered lists, block quotes, preformatted blocks, void elements etc. are good ones. Headers and all inline block elements aren't.
Example
Let's take this very stackoverflow question as an example of HTML content that I would like to truncate. Let's set truncation limit to 1000 with offset of 250 characters (1/4).
This DotNetFiddle shows text of this question while also adding limit markers inside of it (|MIN|
which represents character 750, |LIMIT|
representing character 1000 and |MAX|
that represents character 1250).
As can be seen from example the closest truncation boundary between two block nodes to character 1000 is between </OL>
and P
(My content-editable generated...). This means that my HTML should be truncated right between these two tags which would result in a little bit less than 1000 characters long content text wise, but kept truncated content meaningful because it wouldn't just truncate somewhere in the middle of some text passage.
I hope this explains how things should be working related to this algorithm.
The problem
The first problem I'm seeing here is that I'm dealing with nested structure like HTML. I also have to detect different elements (only block elements and no inline ones). And last but not least I will have to only count certain characters in my string and ignore those that belong to tags.
Possible solutions
- I could parse my content manually by creating some object tree representing content nodes and their hierarchy
- I could convert HTML to something easier to manage like markdown and then simply search for closest new line to my provided index
N
and convert back to HTML - Use something like HTML Agility Pack and replace my #1 parsing with it and then somehow use XPath to extract block nodes and truncate content
Second thoughts
- I'm sure I could make it by doing #1 but it feels I'm reinventing the wheel.
- I don't think there's any C# library for #2 so I should be doing HTML to Markdown manually as well or run i.e. pandoc as an external process.
- I could use HAP as it's great at manipulating HTML, but I'm not sure whether my truncation would be simple enough by using it. I'm afraid the bulk of processing will still be outside HAP in my custom code
How should one approach such truncation algorithm? My head just seems to be too tired to come to a consensus (or solution).
I will run over the whole DOM tree and keep counting the number of text chars that appear. Whenever I hit the limit (N) I will erase the extra characters of that text node and from there on I will just remove all text nodes.
I believe that is a safe way to keep all HTML+CSS structure while retaining only N characters.