I am new to python and I am looking for a way to extract with beautiful soup existing open source books that are available on gutenberg-de, such as this one I need to use them for further analysis and text mining.
I tried this code, found in a tutorial, and it extracts metadata, but instead of the body content it gives me a list of the "pages" I need to scrape the text from.
import requests
from bs4 import BeautifulSoup
# Make a request
page = requests.get(
"https://www.projekt-gutenberg.org/keller/heinrich/")
soup = BeautifulSoup(page.content, 'html.parser')
# Extract title of page
page_title = soup.title
# Extract body of page
page_body = soup.body
# Extract head of page
page_head = soup.head
# print the result
print(page_title, page_head)
I suppose I could use that as a second step to extract it then? I am not sure how, though.
Ideally I would like to store them in a tabular way and be able to save them as csv, preserving the metadata author, title, year, and chapter. any ideas?
What happens?
First of all you will get a list of pages, cause you not requesting the right
url
it to:Recommend that if your looping all the
urls
store the content in alist
ofdicts
and push it tocsv
orpandas
or ...Example