Analyze and edit links in html code with BeautifulSoup

236 views Asked by At

I have a part of html page. I have to find all out links from it and replace them with the mark <can_be_link>.

Next code do almost all what I want, but it fails on links that are located on some lines (not on one) and that lines starts with tabs (in my example this is link with http://bad.com).

How to solve this issue correctly?

# -*- coding: utf-8 -*-
import BeautifulSoup
import re

if __name__=="__main__":
    body = """                  
    <a href="http://good.com" target="_blank">good link</a>
    <ul>
                        <li class="FOLLOW">
                            <a href="http://bad.com" target="_blank">
                                <em></em>
                                <span>
                                    <strong class="FOLLOW-text">Follow On</strong>
                                    <strong class="FOLLOW-logo"></strong>
                                </span>
                            </a>
                        </li>
    </ul>

"""
    metka_link = '<can_be_link>'
    soup = BeautifulSoup.BeautifulSoup(body)
    hrefs = soup.findAll(name = 'a', attrs = { 'href': re.compile('\.*') })
    repl = {}
    for t in hrefs:
        line = str(t)
            # print '\n'*2, line
        if not t.has_key('href'):
            continue
        href = t['href'].lower()
        if href.find('http') == 0 or href.find('//') == 0:
            body = body.replace(line, metka_link)

    print body

The rezult is

<can_be_link>
<ul>
                                        <li class="FOLLOW">
                                                <a href="http://bad.com" target="_blank">
                                                        <em></em>
                                                        <span>
                                                                <strong class="FOLLOW-text">Follow On</strong>
                                                                <strong class="FOLLOW-logo"></strong>
                                                        </span>
                                                </a>
                                        </li>
</ul>

But the desired result must be

<can_be_link>
<ul>
                                        <li class="FOLLOW">
                                                <can_be_link>
                                        </li>
</ul>
1

There are 1 answers

3
alecxe On BEST ANSWER

Use replace_with() method:

PageElement.replace_with() removes a tag or string from the tree, and replaces it with the tag or string of your choice

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup

body = """
<a href="http://good.com" target="_blank">good link</a>
<ul>
                    <li class="FOLLOW">
                        <a href="http://bad.com" target="_blank">
                            <em></em>
                            <span>
                                <strong class="FOLLOW-text">Follow On</strong>
                                <strong class="FOLLOW-logo"></strong>
                            </span>
                        </a>
                    </li>
</ul>

"""

soup = BeautifulSoup(body, 'html.parser')

links = soup.find_all('a')
for link in links:
    link = link.replace_with('<can_be_link>')

print soup.prettify(formatter=None)

prints:

<can_be_link>
<ul>
 <li class="FOLLOW">
  <can_be_link>
 </li>
</ul>

Note the import statement - use the 4th BeautifulSoup version since Beautiful Soup 3 is no longer being developed, and that Beautiful Soup 4 is recommended for all new projects.