I need to scrape dozens of saved html documents for names and email addresses

137 views Asked by At

Ok, so I have dozens of html files full of website source code that I need to scrape to find names and email addresses.

The code has hundreds of lines which look like this:

              <ul class="specialfaa-results">

                        <li >
                            <div class="summary-heading">
                                <h3 class="adviser-name">Mr Joe Bloggs </h3><p class="distance">0.1mi</p>
                                <div class="clearboth"></div>
                                <p class="adviser-company mod-content">Joe Bloggs Company Ltd</p>
                            </div>


                            <div class="full-profile mg-tp-10" style="display:none; margin-left:3px;">
                                <div class="mod-content">

                                    <div class="fl-lf yui3-u-1-3">
                                                  <div class="yui3-u adv-item adv-map">
                                                      <a href="#mapcontainer" class="showGoogle" lng="-1.9111053" lat="52.4771906" title="Business">

                                                      </a>
                                                  </div>
                                    </div>

                                    <div class="fl-lf yui3-u-2-5">
                                            <div class="yui3-u adv-item adv-email">
                                                <a href="mailto:[email protected]">mailto:[email protected]</a>
                                            </div>
                                        <div class="yui3-u adv-item adv-webpage">
                                            <a href="http://www.joebloggs.co.uk" 

My thinking is that I need to isolate the names and email addresses using Python or perhaps excel. I intend to have these names and email addresses finally in an excel document with headings 'Name' ('Joe Bloggs') and 'email address' ([email protected]). What kind of code or process should I use to get these?

Thanks guys! Fairly new to this kind of thing and site so any help would be hugely appreciated.

Hugh.

1

There are 1 answers

0
Dmitrij Holkin On