I've got a big mess of HTML extracted from a Kindle book. And it's got a lot of duplicate elements and duplicate substrings.
Long story short, Kindle DRM deleted 90% of my annotations and I used the Location data it didn't delete to get it all back. But Amazon's location data is somewhat imprecise (corresponding to 150byte chunks), so I ended up with a lot of redundancy.
Example:
<html>
<body>
<p>
aesar”), at the Battle of Pavia (1525).
</p>
<div height="0em">
</div>
<mbp:pagebreak>
</mbp:pagebreak>
<a id="filepos97755">
</a>
<h1 align="center" height="2em">
<font size="5">
<b>
KNOW WHEN
<br/>
TO RETIRE
</b>
</font>
</h1>
<div height="3em">
</div>
<p align="justify" height="0em" width="1em">
</p>
</body>
</html>
<html>
<body>
<h1 align="center" height="2em">
<font size="5">
<b>
KNOW WHEN
<br/>
TO RETIRE
</b>
</font>
</h1>
<div height="3em">
</div>
<p align="justify" height="0em" width="1em">
Anything in motion must wax and wane. Some speak of states of movement, but they are anything but static.
</p>
<div height="0em">
</div>
<p height="0em">
</p>
</body>
</html>
<html>
<body>
<p align="justify" height="0em" width="1em">
Anything in motion must wax and wane. Some speak of states of movement, but they are anything but static.
</p>
<div height="0em">
</div>
<p align="justify" height="0em" width="1em">
It takes great foresight to predict the decline of a restless, relentless wheel. The sharpest gamblers know when to quit
</p>
</body>
</html>
Does anyone have any ideas on what might help?
Gosh that is a mess. From the small bit of output you showed it seems that the important stuff is in the paragraph tags. I would use beautiful soup which is python (http://www.crummy.com/software/BeautifulSoup/bs4/doc/) to pull all the information from the
<P>
tags out and then remove the redundant ones. If you want to also keep the other formatting, that is going to be a bear. I would try to use beautiful soup after I went back and was convinced I couldn't export it in a better format.