Trying to get book summaries from Wikipedia of Project Gutenberg books

118 views Asked by At

I have the complete Project Gutenberg English library as alphabetized csv files with the columns - id, title, text. Here id is in the format /ebooks/15809. Then I am using the Wikipedia python package. I can get the full text of pages and a lot of other details using the package.

This is the first 10 books from Gutenberg -

    ['A Apple Pie',
     'A Apple Pie and Other Nursery Tales',
     'Aaron in the Wildwoods',
     'Aaron Rodd',
     "Aaron's Rod",
     'Aaron the Jew: A Novel',
     'Aaron Trow',
     'Abaft the Funnel',
     'Abandoned',
     'The Abandoned Country; or']

Now when I run pg = wikipedia.page('A Apple Pie'), I get the result for Apple Pie, the desert and not the book. Apparently how the API works is when we call wikipedia.page('xxxx') it does wikipedia.search('xxxx') which returns a list of the search results and returns the wiki page for the first result which in this case is -

>>> wikipedia.search('A Apple Pie')
['Apple pie', 'Pie', 'Apple Pie ABC', 'American Pie (film)', 'Sam Apple Pie', "Mom's Apple Pie", 'Apple Pie Hill', 'Pie à la Mode', 'Apple crisp', 'Pieing']
>>> 

Thus I actually need the third book on the list. A way I have figured out is looking into the categories for each entry in Gutenberg and Wikipedia.

As for the first book in Gutenberg, these are the categories it falls in -

s = 'https://www.gutenberg.org/ebooks/15809'

import requests
from bs4 import BeautifulSoup as bs

#page_url = base_url + alphabet
page = requests.get(s)
soup = bs(page.content, 'html.parser')
bibrec_tbl = soup.find("table", {"class": "bibrec"})
for td in list(bibrec_tbl.findChildren('td')):
    lowered = str(td).lower()
    if 'itemprop' in lowered:
        a = lowered[lowered.find('itemprop') + 10 :]
        b = a[: a.find('"')]
        print('itemprop', '\t', b, '\t', td.text.strip())
    elif 'property' in lowered:
        a = lowered[lowered.find('property') + 10 :]
        b = a[: a.find('"')]
        print('property', '\t', b, '\t', td.text.strip())



itemprop     creator     Greenaway, Kate, 1846-1901
itemprop     headline    A Apple Pie
property     dcterms:subject     Children's poetry
property     dcterms:subject     Nursery rhymes
property     dcterms:subject     Alphabet rhymes
property     dcterms:subject     Alphabet
property     dcterms:type    Text
itemprop     datepublished   May 10, 2005
property     dcterms:rights      Public domain in the USA.
itemprop     interactioncount    188 downloads in the last 30 days.
itemprop     pricecurrency   $0.00

And for the third Wikipedia result -

pg = wikipedia.page('Apple Pie ABC')
print(pg.categories)

['Alphabet books',
 'Articles with short description',
 'British picture books',
 'CS1 maint: discouraged parameter',
 'Commons category link is on Wikidata',
 "English children's songs",
 'English folk songs',
 'English nursery rhymes',
 'Short description matches Wikidata',
 "Traditional children's songs"]

So what I can do is do a cosine similarity between both categories, and hope that the threshold is close enough to match title to category.

Is there a better or more efficient way to do this? Thanks.

0

There are 0 answers