Sluggish Performance using ITextSharp XMLWorkerHelper and Parsing HTML

2k views Asked by At

I've recently encountered a performance issue involving ITextSharp taking extremely long times (often 30+ seconds) to render HTML content (being passed from an HTML Editor such as CKEditor, TinyMCE, etc).

Previously, the HTMLWorker was used to parse the content and it worked great. It was fast and fairly accurate, however when more complex HTML (such as tables, ordered lists and unordered lists) began to be passed in, it started to falter :

//The HTML Worker was quick, however it's weaknesses began to show with more 
//complex HTML
List<IElement> objects = HTMLWorker.ParseToList(sr, ss);

The complex markup is a requirement in this situation and rather than attempting to perform Regular Expression surgery and other nasty things to try and fix these issues, I elected to use the XMLWorker to handle parsing.

//This outputs everything perfectly and retains all of the proper styling that is
//needed. However, when things get complex it gets sluggish
XMLWorkerHelper.GetInstance().ParseXHtml(writer,document,stringReader);

The XMLWorker results were incredible and it output everything just as we needed, but it's performance rendered it nearly unusable. As the complexity of the contents increased (through additional tables, styles and lists) so did the loading times.

The line above appears to be the performance bottleneck and trying several different alternatives using it didn't help at all (such as creating a basic custom XmlHandler).

Possible Causes and Ideas

  • I tried going through and stripping out any extraneous and invalid markup from the contents that are being passed in, but that did little.

  • Could the issue be with iTextSharp itself and how the XMLWorkerHelper is working? I attempted to use the SAME input within the iText XML Helper Demo here and it was amazingly fast. I figured the performance would be at least comparable.

  • Current considerations would be using a method of storage to actual store the rendered PDFs and then retrieving them on-demand as opposed to generating them dynamically. I would prefer to avoid this but it's on the table.

  • The Content is being pasted from Microsoft Word (cringe) which I have tried to clean up as much as possible, but I don't believe to be a major issue since the iText Demo mentioned above had no major issues with the same content.

  • Possible alternatives to using iTextSharp?

I would be glad to provide any additional details and code that I can.

1

There are 1 answers

0
Rion Williams On BEST ANSWER

Although this issue is a few years old, I thought that I would let any future readers know that I eventually elected to use wkhtmltopdf library via the TuesPechkin project.

The performance was a significant improvement over iTextSharp and it has great documentation with implementation examples for a variety of scenarios that may suit your existing project.