Remove certain parts of web page using beautifulsoup

Question

Remove certain parts of web page using beautifulsoup

378 views Asked by user3450783 At 17 June 2020 at 21:41

I am trying to read links from a page, but I am getting more links than desired. What I am doing is:

http = httplib2.Http()
status, page= http.request('page address')
soup = BeautifulSoup(page,'html.parser', parse_only=SoupStrainer('a'))
For link in soup:
 if link.has_attr('href'):
    print(link['href'])

I inspected the page and noticed that it has two main components:

<div id="main">
<aside id="secondary">

The links that I don't want are coming from what is inside <aside id="secondary">. What is the easiest way to only get links from <div id="main">?

Thanks

Original Q&A

There are 2 answers

Yaakov Bressler On 17 June 2020 at 21:54

I would suggest using the find_all operator of beautifulsoup:

my_links = soup.find_all("a", {"id":"main", "href":True})
my_links = [x["href"] for x in my_links]

Assuming your webpage contains links inside a parent div, you can do the following:

my_divs = soup.find_all("div", {"id":"main"})
my_links = [x.find_all("a", {"href":True}, recursive=False) for x in my_divs]
# flatten
my_links = [x for y in my_links for x in y]
# extract hrefs
my_links = [x["href"] for x in my_links]

**Andrej Kesely** · Accepted Answer · 2020-06-17T21:45:06+00:00

Andrej Kesely On 17 June 2020 at 21:45 BEST ANSWER

To select <a> links that are under <div id="main"> you can use CSS selector:

for a in soup.select('div#main a'):
    print(a)

For links only that have href= attribute:

for a in soup.select('div#main a[href]'):
    print(a['href'])

TechQA.

Remove certain parts of web page using beautifulsoup

There are 2 answers

Related Questions in PYTHON

Related Questions in BEAUTIFULSOUP

Related Questions in HTTPLIB2

Popular Questions

Trending Questions