I'm trying to convert HTML
to PDF
using iTextSharp in an ASP.NET
web application that uses both MVC, and web forms. The <img>
and <a>
elements have absolute and relative URLs, and some of the <img>
elements are base64. Typical answers here at SO and Google search results use generic HTML
to PDF
code with XMLWorkerHelper
that looks something like this:
using (var stringReader = new StringReader(xHtml))
{
using (Document document = new Document())
{
PdfWriter writer = PdfWriter.GetInstance(document, stream);
document.Open();
XMLWorkerHelper.GetInstance().ParseXHtml(
writer, document, stringReader
);
}
}
So with sample HTML
like this:
<div>
<h3>HTML Works, but Broken in Converted PDF</h3>
<div>Relative local <img>: <img src='./../content/images/kuujinbo_320-30.gif' /></div>
<div>
Base64 <img>:
<img src='data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg==' />
</div>
<div><a href='/somePage.html'>Relative local hyperlink, broken in PDF</a></div>
<div>
The resulting PDF: (1) is missing all images, and (2) all hyperlink(s) with relative URLs are broken and use a file URI scheme (file///XXX...
) instead of pointing to the correct web site.
Some answers here at SO and others from Google search recommend replacing relative URLs with absolute URLs, which is perfectly acceptable for one-off cases. However, globally replacing all <img src>
and <a href>
attributes with a hard-coded string is unacceptable for this question, so please do not post an answer like that, because it will accordingly be downvoted.
Am looking for a solution that works for many different web applications residing in test, development, and production environments.
Out of the box
XMLWorker
only understands absolute URIs, so the described issues are expected behavior. The parser can't automagically deduce URI schemes or paths without some additional information.Implementing an ILinkProvider fixes the broken hyperlink problem, and implementing an IImageProvider fixes the broken image problem. Since both implementations must perform URI resolution, that's the first step. The following helper class does that, and also tries to make web (
ASP.NET
) context calls (examples follow) as simple as possible:Implementing
ILinkProvider
is pretty simple now thatUriHelper
gives the base URI. We just need the correct URI scheme (file
orhttp[s]
):IImageProvider
only requires implementing a single method,Retrieve(string src)
, butStore(string src, Image img)
is easy - note inline comments there and forGetImageRootPath()
:Based on the XML Worker documentation it's pretty straightforward to hook the implementations of
ILinkProvider
andIImageProvider
above into a simple parser class:As commented inline,
SimpleAjaxImgFix(string xHtml)
specifically handles XHR that may send unclosed<img>
tags, which is validHTML
, but invalidXML
that will breakXMLWorker
. A simple explanation & implementation of how to receive a PDF or other binary data with XHR and iTextSharp can be found here.A
Regex
was used inSimpleAjaxImgFix(string xHtml)
so that anyone using (copy/paste?) the code doesn't need to add anothernuget
package, but aHTML
parser like HtmlAgilityPack should be used, since it's turns this:into this:
with only a few lines of code:
Also of note - use
SimpleParser.Parse()
above as a general blueprint to additionally implement a custom ICSSResolver or ITagProcessorFactory, which is explained in the documentation.Now the issues described in the question should be taken care of. Called from a
MVC Action Method
:or from a
Web Form
that getsHTML
from a server control:or a simple HTML file with hyperlinks and images on the file system:
or HTML from a remote web site:
Above two
HTML
snippets run from a console app:Quite a long answer, but taking a look at questions here at SO tagged
html
,pdf
, anditextsharp
, as of this writing (2016-02-23) there are 776 results against 4,063 total taggeditextsharp
- that's 19%.