I'm trying to convert HTML to PDF using iTextSharp in an ASP.NET web application that uses both MVC, and web forms. The <img> and <a> elements have absolute and relative URLs, and some of the <img> elements are base64. Typical answers here at SO and Google search results use generic HTML to PDF code with XMLWorkerHelper that looks something like this:
using (var stringReader = new StringReader(xHtml))
{
using (Document document = new Document())
{
PdfWriter writer = PdfWriter.GetInstance(document, stream);
document.Open();
XMLWorkerHelper.GetInstance().ParseXHtml(
writer, document, stringReader
);
}
}
So with sample HTML like this:
<div>
<h3>HTML Works, but Broken in Converted PDF</h3>
<div>Relative local <img>: <img src='./../content/images/kuujinbo_320-30.gif' /></div>
<div>
Base64 <img>:
<img src='data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg==' />
</div>
<div><a href='/somePage.html'>Relative local hyperlink, broken in PDF</a></div>
<div>
The resulting PDF: (1) is missing all images, and (2) all hyperlink(s) with relative URLs are broken and use a file URI scheme (file///XXX...) instead of pointing to the correct web site.
Some answers here at SO and others from Google search recommend replacing relative URLs with absolute URLs, which is perfectly acceptable for one-off cases. However, globally replacing all <img src> and <a href> attributes with a hard-coded string is unacceptable for this question, so please do not post an answer like that, because it will accordingly be downvoted.
Am looking for a solution that works for many different web applications residing in test, development, and production environments.
Out of the box
XMLWorkeronly understands absolute URIs, so the described issues are expected behavior. The parser can't automagically deduce URI schemes or paths without some additional information.Implementing an ILinkProvider fixes the broken hyperlink problem, and implementing an IImageProvider fixes the broken image problem. Since both implementations must perform URI resolution, that's the first step. The following helper class does that, and also tries to make web (
ASP.NET) context calls (examples follow) as simple as possible:Implementing
ILinkProvideris pretty simple now thatUriHelpergives the base URI. We just need the correct URI scheme (fileorhttp[s]):IImageProvideronly requires implementing a single method,Retrieve(string src), butStore(string src, Image img)is easy - note inline comments there and forGetImageRootPath():Based on the XML Worker documentation it's pretty straightforward to hook the implementations of
ILinkProviderandIImageProviderabove into a simple parser class:As commented inline,
SimpleAjaxImgFix(string xHtml)specifically handles XHR that may send unclosed<img>tags, which is validHTML, but invalidXMLthat will breakXMLWorker. A simple explanation & implementation of how to receive a PDF or other binary data with XHR and iTextSharp can be found here.A
Regexwas used inSimpleAjaxImgFix(string xHtml)so that anyone using (copy/paste?) the code doesn't need to add anothernugetpackage, but aHTMLparser like HtmlAgilityPack should be used, since it's turns this:into this:
with only a few lines of code:
Also of note - use
SimpleParser.Parse()above as a general blueprint to additionally implement a custom ICSSResolver or ITagProcessorFactory, which is explained in the documentation.Now the issues described in the question should be taken care of. Called from a
MVC Action Method:or from a
Web Formthat getsHTMLfrom a server control:or a simple HTML file with hyperlinks and images on the file system:
or HTML from a remote web site:
Above two
HTMLsnippets run from a console app:Quite a long answer, but taking a look at questions here at SO tagged
html,pdf, anditextsharp, as of this writing (2016-02-23) there are 776 results against 4,063 total taggeditextsharp- that's 19%.