How do I scrape a headline from a BBC article using Python and Beautiful Soup?

Question

How do I scrape a headline from a BBC article using Python and Beautiful Soup?

48 views Asked by Tom At 26 March 2024 at 22:16

I've previously built a BBC scraper which, among other things, scrape the headline from a given article such as this. However, BBC has recently changed their website, so I need to modify my scraper, which has proven to be difficult. For example, say I want to scrape the headline from the previously mentioned article. Inspecting the HTML using Firefox, I find the corresponding HTML attribute, which is data-component="headline-block" (see the blue marked line in the image).

If I want to extract the corresponding tag, I'll do this:

import requests

from bs4 import BeautifulSoup

url = 'https://www.bbc.com/news/world-africa-68504329'

# extract html
html = requests.get(url).text

# parse html
soup = BeautifulSoup(html, 'html.parser')

# extract headline from soup
head = soup.find(attrs = {'data-component': 'headline-block'})

But when I print the value of head it returns None, which means that Beautiful Soup can't find the tag. What am I missing? How do I solve this problem?

Original Q&A

There are 1 answers

**Andrej Kesely** · Accepted Answer · 2024-03-26T22:29:14+00:00

The data you see on the page is stored in Json form inside the page (so beautifulsoup doesn't see it). To get the headline + article text you can use this example:

import json

import requests
from bs4 import BeautifulSoup

url = "https://www.bbc.com/news/world-africa-68504329"

soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.select_one("#__NEXT_DATA__").text)

# print(json.dumps(data, indent=4))

page = next(
    v for k, v in data["props"]["pageProps"]["page"].items() if k.startswith("@")
)
for c in page["contents"]:
    match c["type"]:
        case "headline":
            print(c["model"]["blocks"][0]["model"]["text"])
            print()
        case "text":
            print(c["model"]["blocks"][0]["model"]["text"], end=" ")

print()

Prints:

Kuriga kidnap: More than 280 Nigerian pupils abducted

More than 280 Nigerian school pupils have been abducted in the north-western town of Kuriga, officials say.  The pupils were in the assembly ground around 08:30 (07:30 GMT) when dozens of gunmen on motorcycles rode through the school, one witness said. The students, between the ages of eight and 15, were taken away, along with a teacher, they added. Kidnap gangs, known as bandits, have seized thousands of people in recent years, especially the north-west. However, there had been a reduction in the mass abduction of children over the past year until this week. Those kidnapped are usually freed after a ransom is paid. The mass abduction was

...

TechQA.

How do I scrape a headline from a BBC article using Python and Beautiful Soup?

There are 1 answers

Related Questions in PYTHON

Related Questions in WEB-SCRAPING

Related Questions in BEAUTIFULSOUP

Popular Questions

Trending Questions