python - Retrieve and save links from webpage but only one per domain

360 views Asked by At

I'm having a bit of trouble trying to save the links from a website into a list without repeating urls with same domain

Example:
www.python.org/download and www.python.org/about

should only save the first one (www.python.org/download) and not repeat it later


This is what i've got so far

from bs4 import BeautifulSoup
import requests
from urllib.parse import urlparse

url = "https://docs.python.org/3/library/urllib.request.html#module-urllib.request"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
atag = doc.find_all('a', href=True)
links = []
#below should be some kind of for loop


1

There are 1 answers

1
bitflip On

As a one-liner:

links = {nl for a in doc.find_all('a', href=True) if (nl := urlparse(a["href"]).netloc) != ""}

Explained:

links = set()  # define empty set
for a in doc.find_all('a', href=True):  # loop over every <a> element
    nl = urlparse(a["href"]).netloc  # get netloc from url
    if nl:
        links.add(nl)  # add to set if exists

output:

{'www.w3.org', 'datatracker.ietf.org', 'www.python.org', 'requests.readthedocs.io', 'github.com', 'www.sphinx-doc.org'}