I know how terribly wrong it is to (attempt to) parse HTML with Regex, which is why I'm trying really, really hard to avoid it.
I have an app that generates HTML emails. We're using a big fancy WYSIWYG drop in email editor that takes care of generating responsive emails and also generates the abysmal markup for clients like MS Outlook. It does that last bit using conditional comments, which look something like the following. Note that the <v:roundrect> has an href attribute, and wraps the <a> tag that non-mso clients will see.
<!--[if mso]>
<!-- irrelevant <table> layout here, removed for your sanity and mine -->
<v:roundrect href="https://google.com" irrelevant_attributes="snipped">
<w:anchorlock/>
<v:textbox inset="0,0,0,0">
<center style="snipped">
<![endif]-->
<a href="https://google.com" target="_blank" <!-- style="snipped" --> >
<!-- some span tags with styles on them -->
click me!
<!-- </spans> -->
</a>
<!--[if mso]>
</center>
</v:textbox>
</v:roundrect>
<!-- </table> layout stuff -->
<![endif]-->
Of course, this is just one of dozens (possibly hundreds?) of possible formattings that we need to work with.
Prior to the introduction of this editor, we asked our customers to generate their own HTML emails with a more rudimentary WYSIWYG HMTL editor; but it was incumbent on them to make responsive templates and test their content in various clients. From their perspective, this new editor is a huge win.
As we're sending emails, it's important to track the link clicks via a tracking link that redirects through to the originally-intended link.
To date, we've used jSoup to parse the email content, looking for any anchor tags and replace their href attribute contents. Because regex html parsing is evil, right?
Conditional comments have thrown a wrench in those gears.
Because they are comments, jSoup ignores them, and clicks from MS Outlook and other clients that handle the <v:roundrect> markup haven't been transformed to go through our link tracker, so the clicks don't get tracked. This is a problem for us.
First idea: replace the conditional comments with a custom tag
At first I was hopeful to pre-process the message body before letting jSoup have it. I would replace <!--[if mso]> with <adam> and <![endif]--> with </adam>. This was simple enough to do, even for complex forms of the conditions inside the comments. I used a regex to make some simple replacements:
<!--[if mso]>became<adam orig="%3C%21%2D%2D%5Bif%20mso%5D%3E"><!--[if (!mso)&(!IE)]>became<adam orig="%3C%21%2D%2D%5Bif%20%28%21mso%29%26%28%21IE%29%5D%3E">- etc
Notice that I url-encoded the original comment in its entirety. url-encoding it made sure that I could easily use regex to find my marker comments and transform them back (so that I didn't have to worry about > inside the "orig" attribute content...
This started to break down when I realized there were multiple possible ways the comments could be closed. I spent a little bit of time working on a similar approach for the closing tags.
<!--<![endif]-->became</adam orig="%3C%21%2D%2D%3C%21%5Bendif%5D%2D%2D%3E">- same approach for
<![endif]-->and<!-->
I don't know if you can have attributes on a closing tag. I never tested it because I had another realization before I got to that point. The realization was that using <adam></adam> wasn't going to produce desirable output from jSoup because the resulting INPUT would often look like:
<adam><table><v:roundrect><center></adam>
<a></a>
<adam></center></v:roundrect></table></adam>
This is not tidy HTML and jSoup will try to correct it, changing the order of tags to make something that it thinks is more correct. When I realized that, I stopped what I was doing and started thinking about the problem again.
Second idea: the same thing, but with comments
If the (new) problem was that jSoup didn't like my tag nesting, what if I could expose the HTML from inside the conditional comments as if it weren't commented out, but keep some markers in as comments that I can later transform back into comments? The goal was to aim at making this:
<!-- adam --><table><v:roundrect><center><!-- /adam -->
<a></a>
<!-- adam --></center></v:roundrect></table><!-- /adam -->
This should parse as fairly tidy HTML, right? So I made the code modifications and gave it a shot.
Sadly, the documents that we're working with are far more complex than the simple example I started from above. Here's the first few lines of an actual sample document:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional //EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office">
<head>
<!--[if gte mso 9]><xml><o:OfficeDocumentSettings><o:AllowPNG/><o:PixelsPerInch>96</o:PixelsPerInch></o:OfficeDocumentSettings></xml><![endif]-->
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="viewport" content="width=device-width">
<!--[if !mso]><!-->
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<!--<![endif]-->
<title></title>
<!--[if !mso]><!-->
<!--<![endif]-->
<style type="text/css">/* snip */</style>
<style type="text/css" id="media-query">/* snip */</style>
</head>
<body class="clean-body" style="margin: 0; padding: 0; -webkit-text-size-adjust: 100%; background-color: #FFFFFF;">
<style type="text/css" id="media-query-bodytag">
After the comment conversion, we've effectively dropped an <xml> block into the <head> block, of which jSoup is decidedly not a fan. This is what I get back for the above input, after converting the conditional comments into my plain marked comments, parsing with jSoup, and then converting my markers back to their conditional comments:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional //EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office">
<head>
<!--[if gte mso 9]>
</head>
<body class="clean-body" style="margin: 0; padding: 0; -webkit-text-size-adjust: 100%; background-color: #FFFFFF;">
<xml> <o:officedocumentsettings> <o:allowpng /> <o:pixelsperinch> 96 </o:pixelsperinch> </o:officedocumentsettings> </xml>
<![endif]-->
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="viewport" content="width=device-width">
<!--[if !mso]><!-->
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<!--<![endif]-->
<title></title>
<!--[if !mso]><!--><!--<![endif]-->
<style type="text/css">/* snip */</style>
<style type="text/css" id="media-query">/* snip */</style>
<style type="text/css" id="media-query-bodytag">/* snip */</style>
There are some big problems here. The <head> block gets basically immediately closed. The <body> tag moves up to before the <xml> block, and everything that came after it moved down into the body. This isn't going to work.
Now what?
I feel like we're basically out of options.
- Do nothing and just don't count the clicks from MS Outlook/etc clients. In some cases we might be able to detect a click anyway via a downstream conversion on that email. (Even if we don't have record of you clicking the link, if you made a payment then we know you got there...)
- We could let our mail provider do the link tracking for us (experimentation required; not positive they would track the
<v:roundrect>links either). Historically we started this system with a provider that didn't offer link tracking so we had to roll our own. Current provider offers it, but we've got years of existing code and processes that would have to be updated to support this change. We're keeping it in our back pocket if we can't figure something else out, but the prospect of changing ships mid-stream is ... not appealing. - Or lastly... maybe... regex? (/me ducks) We could let jSoup do its thing for the normal HTML, and then use regex to replace any links that remain. This becomes a game of whack-a-mole with current and future markup. What might we run into aside from a
<v:roundrect>in the future? ¯\_(ツ)_/¯ And we won't know what we're missing without regular manual reviews.
Unless there's another option that we haven't explored yet. So... are we stuck with nothing/regex?
We're on the JVM so anything Java is within reach, I guess.