scrape book body text from project gutenberg de

Question

scrape book body text from project gutenberg de

1.4k views Asked by Grig At 26 January 2021 at 09:34

I am new to python and I am looking for a way to extract with beautiful soup existing open source books that are available on gutenberg-de, such as this one I need to use them for further analysis and text mining.

I tried this code, found in a tutorial, and it extracts metadata, but instead of the body content it gives me a list of the "pages" I need to scrape the text from.

import requests
from bs4 import BeautifulSoup

# Make a request
page = requests.get(
    "https://www.projekt-gutenberg.org/keller/heinrich/")
soup = BeautifulSoup(page.content, 'html.parser')

# Extract title of page
page_title = soup.title

# Extract body of page
page_body = soup.body

# Extract head of page
page_head = soup.head

# print the result
print(page_title, page_head)

I suppose I could use that as a second step to extract it then? I am not sure how, though.

Ideally I would like to store them in a tabular way and be able to save them as csv, preserving the metadata author, title, year, and chapter. any ideas?

Original Q&A

There are 1 answers

**HedgeHog** · Accepted Answer · 2021-01-26T10:32:08+00:00

What happens?

First of all you will get a list of pages, cause you not requesting the right url it to:

page = requests.get('https://www.projekt-gutenberg.org/keller/heinrich/hein101.html')

Recommend that if your looping all the urls store the content in a list of dicts and push it to csv or pandas or ...

Example

import requests
from bs4 import BeautifulSoup

data = []

# Make a request
page = requests.get('https://www.projekt-gutenberg.org/keller/heinrich/hein101.html')
soup = BeautifulSoup(page.content, 'html.parser')

data.append({
    'title': soup.title,
    'chapter': soup.h2.get_text(),
    'text': ' '.join([p.get_text(strip=True) for p in soup.select('body p')[2:]])
    }
)

data

TechQA.

scrape book body text from project gutenberg de

There are 1 answers

What happens?

Related Questions in PYTHON

Related Questions in WEB-SCRAPING

Related Questions in BEAUTIFULSOUP

Related Questions in PROJECT-GUTENBERG

Popular Questions

Popular Tags

Trending Questions