I have a part of html page. I have to find all out links from it and replace them with the mark <can_be_link>
.
Next code do almost all what I want, but it fails on links that are located on some lines (not on one) and that lines starts with tabs (in my example this is link with http://bad.com).
How to solve this issue correctly?
# -*- coding: utf-8 -*-
import BeautifulSoup
import re
if __name__=="__main__":
body = """
<a href="http://good.com" target="_blank">good link</a>
<ul>
<li class="FOLLOW">
<a href="http://bad.com" target="_blank">
<em></em>
<span>
<strong class="FOLLOW-text">Follow On</strong>
<strong class="FOLLOW-logo"></strong>
</span>
</a>
</li>
</ul>
"""
metka_link = '<can_be_link>'
soup = BeautifulSoup.BeautifulSoup(body)
hrefs = soup.findAll(name = 'a', attrs = { 'href': re.compile('\.*') })
repl = {}
for t in hrefs:
line = str(t)
# print '\n'*2, line
if not t.has_key('href'):
continue
href = t['href'].lower()
if href.find('http') == 0 or href.find('//') == 0:
body = body.replace(line, metka_link)
print body
The rezult is
<can_be_link>
<ul>
<li class="FOLLOW">
<a href="http://bad.com" target="_blank">
<em></em>
<span>
<strong class="FOLLOW-text">Follow On</strong>
<strong class="FOLLOW-logo"></strong>
</span>
</a>
</li>
</ul>
But the desired result must be
<can_be_link>
<ul>
<li class="FOLLOW">
<can_be_link>
</li>
</ul>
Use replace_with() method:
prints:
Note the import statement - use the 4th
BeautifulSoup
version sinceBeautiful Soup 3 is no longer being developed, and that Beautiful Soup 4 is recommended for all new projects.