Remove Duplicate Substrings/Elements from Scraped HTML?

92 views Asked by At

I've got a big mess of HTML extracted from a Kindle book. And it's got a lot of duplicate elements and duplicate substrings.

Long story short, Kindle DRM deleted 90% of my annotations and I used the Location data it didn't delete to get it all back. But Amazon's location data is somewhat imprecise (corresponding to 150byte chunks), so I ended up with a lot of redundancy.

Example:

<html>
 <body>
  <p>
   aesar”), at the Battle of Pavia (1525).
  </p>
  <div height="0em">
  </div>
  <mbp:pagebreak>
  </mbp:pagebreak>
  <a id="filepos97755">
  </a>
  <h1 align="center" height="2em">
   <font size="5">
    <b>
     KNOW WHEN
     <br/>
     TO RETIRE
    </b>
   </font>
  </h1>
  <div height="3em">
  </div>
  <p align="justify" height="0em" width="1em">
  </p>
 </body>
</html>

<html>
 <body>
  <h1 align="center" height="2em">
   <font size="5">
    <b>
     KNOW WHEN
     <br/>
     TO RETIRE
    </b>
   </font>
  </h1>
  <div height="3em">
  </div>
  <p align="justify" height="0em" width="1em">
   Anything in motion must wax and wane. Some speak of states of movement, but they are anything but static.
  </p>
  <div height="0em">
  </div>
  <p height="0em">
  </p>
 </body>
</html>



<html>
 <body>
  <p align="justify" height="0em" width="1em">
   Anything in motion must wax and wane. Some speak of states of movement, but they are anything but static.
  </p>
  <div height="0em">
  </div>
  <p align="justify" height="0em" width="1em">
   It takes great foresight to predict the decline of a restless, relentless wheel. The sharpest gamblers know when to quit
  </p>
 </body>
</html>

Does anyone have any ideas on what might help?

1

There are 1 answers

1
dstudeba On BEST ANSWER

Gosh that is a mess. From the small bit of output you showed it seems that the important stuff is in the paragraph tags. I would use beautiful soup which is python (http://www.crummy.com/software/BeautifulSoup/bs4/doc/) to pull all the information from the <P> tags out and then remove the redundant ones. If you want to also keep the other formatting, that is going to be a bear. I would try to use beautiful soup after I went back and was convinced I couldn't export it in a better format.