Ok, so I have dozens of html files full of website source code that I need to scrape to find names and email addresses.
The code has hundreds of lines which look like this:
<ul class="specialfaa-results">
<li >
<div class="summary-heading">
<h3 class="adviser-name">Mr Joe Bloggs </h3><p class="distance">0.1mi</p>
<div class="clearboth"></div>
<p class="adviser-company mod-content">Joe Bloggs Company Ltd</p>
</div>
<div class="full-profile mg-tp-10" style="display:none; margin-left:3px;">
<div class="mod-content">
<div class="fl-lf yui3-u-1-3">
<div class="yui3-u adv-item adv-map">
<a href="#mapcontainer" class="showGoogle" lng="-1.9111053" lat="52.4771906" title="Business">
</a>
</div>
</div>
<div class="fl-lf yui3-u-2-5">
<div class="yui3-u adv-item adv-email">
<a href="mailto:[email protected]">mailto:[email protected]</a>
</div>
<div class="yui3-u adv-item adv-webpage">
<a href="http://www.joebloggs.co.uk"
My thinking is that I need to isolate the names and email addresses using Python or perhaps excel. I intend to have these names and email addresses finally in an excel document with headings 'Name' ('Joe Bloggs') and 'email address' ([email protected]). What kind of code or process should I use to get these?
Thanks guys! Fairly new to this kind of thing and site so any help would be hugely appreciated.
Hugh.
Try to extract email with regex
Extract emails from html using regex
https://gist.github.com/dideler/5219706