python - Retrieve and save links from webpage but only one per domain

Question

python - Retrieve and save links from webpage but only one per domain

360 views Asked by Krosvick At 26 September 2022 at 18:05

I'm having a bit of trouble trying to save the links from a website into a list without repeating urls with same domain

Example:
www.python.org/download and www.python.org/about

should only save the first one (www.python.org/download) and not repeat it later

This is what i've got so far

from bs4 import BeautifulSoup
import requests
from urllib.parse import urlparse

url = "https://docs.python.org/3/library/urllib.request.html#module-urllib.request"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
atag = doc.find_all('a', href=True)
links = []
#below should be some kind of for loop

Original Q&A

There are 1 answers

**bitflip** · Answer 1 · 2022-09-26T18:37:33+00:00

As a one-liner:

links = {nl for a in doc.find_all('a', href=True) if (nl := urlparse(a["href"]).netloc) != ""}

Explained:

links = set()  # define empty set
for a in doc.find_all('a', href=True):  # loop over every <a> element
    nl = urlparse(a["href"]).netloc  # get netloc from url
    if nl:
        links.add(nl)  # add to set if exists

output:

{'www.w3.org', 'datatracker.ietf.org', 'www.python.org', 'requests.readthedocs.io', 'github.com', 'www.sphinx-doc.org'}

TechQA.

python - Retrieve and save links from webpage but only one per domain

There are 1 answers

Related Questions in PYTHON

Related Questions in WEB-SCRAPING

Related Questions in BEAUTIFULSOUP

Related Questions in PYTHON-REQUESTS

Related Questions in URLPARSE

Popular Questions

Trending Questions