How to extract url from <a href="TextWithUrlBehind">Something</a> using BeautifulSoup?

38 views Asked by At

I am trying to extract some links and text in a .json file from a web-page.

I have parsed the HTML tbody > tr > td, and each td contains <a href="TextWithUrlBehind">Something</a>

But this TextWithUrlBehind in Inspect Element is clickable, it has a link attached to it. It is not a well-known <a href=https//...>

So, my extraction of href is str: TextWithUrlBehind, then text(also str):Something in the .json file

The code looks like this:

rows = test_results_table.find_all("tr")
                
# Iterate over each anchor tag
for row in rows:
    first_cell = row.find("td")
    if first_cell:
        anchor_tag = first_cell.find("a", href=True)
        self._debug_print("Anchor tag content:", anchor_tag)
        if anchor_tag:
            href = anchor_tag["href"]
            text = anchor_tag.get_text(strip=True)
            links.append({"href": href, "text": text})
            self._debug_print("Content extracted:", {"href": href, "text": text})
        else:
            self._debug_print("No anchor tag found in cell:", first_cell)
    else:
        self._debug_print("No table cell found in row:", row)

I do not understand how that link is attached in HTML, and I don't know how beautifulsoup built-in functions can help me to get that link.

1

There are 1 answers

0
Suramuthu R On
from bs4 import BeautifulSoup as bs
import requests as rq

#Replace <your url> with the url you want to scrap
url ='<your url>'

r=requests.get(url)
soup=bs(r.content,"html.parser")
links = soup.find_all("a") 

# Create an empty dict
dct = {}
for x in links:

    # Get keys of the dict being clickable text and value being links
    key = x.string
    val = x.get("href")
    dct[key] = val
    
print(dct)

The output will be a dictionary in which the keys are clickable texts and the values are the links these texts lead to if clicked.