Extracting text from damaged HTML?

106 views Asked by At

DRM is a plague even in the books industry. Last week I discovered many of my Kindle annotations were missing because a publisher sought fit to limit annotations to 10% of the book.

I've discovered tools for converting the Mobi book file to HTML. I've also used the location data (thankfully this wasn't missing) to extract the appropriate chunks of raw html. My problem now is that I have a lot of incomplete markup language to deal with.

Example:

></h1><div height="3em"></div> <p height="0em" width="1em" align="justify"><em>A Pocket Mirror for Heroes</em> is a book of stratagems for reaching excellence in a competitive world ruled by appearances and, often, deceit.</p><div height="0em"></div> <p height="0em" width="1em" align="justify">It is a <em>mirror</em> because it reflects &#x201C;the person you are or the one you ought to be.&#x201D; A <em>pocket</em> mirror because its author took the time to be brief. A mirror for <em>heroes</em> because it provides a vivid image of ethical and moral perfection. For the author, a hero is &#x201C;the consummate person, ripe and perfect: accurate in judgment, mature in taste, attentive in listening, wise in sayings, shrewd in deeds, the cente

This is because the location data in Kindle only corresponds to 150 byte chunks of HTML data. This means there's a lot of imprecision.

I'd like to clean this up. Does anyone have any suggestions? I'd prefer to use Python if possible.

Edit: What also might make sense is to use a tool that you can give character offsets to and it figures out how to extract something legible from it. Does something like that exist?

1

There are 1 answers

5
fferri On BEST ANSWER

BeautifulSoup can parse malformed HTML and it's pretty robust.

>>> html = "<p>Para 1<p>Para 2<blockquote>Quote 1<blockquote>Quote 2"
>>> soup = BeautifulSoup(html)
>>> print(soup.prettify())
<p>
 Para 1
 <p>
  Para 2
  <blockquote>
   Quote 1
   <blockquote>
    Quote 2
   </blockquote>
  </blockquote>
 </p>
</p>