Scraping Google News with pygooglenews

4.4k views Asked by At

I am trying to do scraping from Google News with pygooglenews. I am trying to scrape more than 100 articles at a time (as google sets limit at 100) by changing the target dates using for loop. The below is what I have so far but I keep getting error message

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-84-4ada7169ebe7> in <module>
----> 1 df = pd.DataFrame(get_news('Banana'))
      2 writer = pd.ExcelWriter('My Result.xlsx', engine='xlsxwriter')
      3 df.to_excel(writer, sheet_name='Results', index=False)
      4 writer.save()

<ipython-input-79-c5266f97934d> in get_titles(search)
      9 
     10     for date in date_list[:-1]:
---> 11         search = gn.search(search, from_=date, to_=date_list[date_list.index(date)])
     12         newsitem = search['entries']
     13 

~\AppData\Roaming\Python\Python37\site-packages\pygooglenews\__init__.py in search(self, query, helper, when, from_, to_, proxies, scraping_bee)
    140         if from_ and not when:
    141             from_ = self.__from_to_helper(validate=from_)
--> 142             query += ' after:' + from_
    143 
    144         if to_ and not when:

TypeError: unsupported operand type(s) for +=: 'dict' and 'str'
import pandas as pd
from pygooglenews import GoogleNews
import datetime

gn = GoogleNews()

def get_news(search):
    stories = []
    start_date = datetime.date(2021,3,1)
    end_date = datetime.date(2021,3,5)
    delta = datetime.timedelta(days=1)
    date_list = pd.date_range(start_date, end_date).tolist()
    
    for date in date_list[:-1]:
        search = gn.search(search, from_=date.strftime('%Y-%m-%d'), to_=(date+delta).strftime('%Y-%m-%d'))
        newsitem = search['entries']

        for item in newsitem:
            story = {
                'title':item.title,
                'link':item.link,
                'published':item.published
            }
            stories.append(story)

    return stories

df = pd.DataFrame(get_news('Banana'))

Thank you in advance.

2

There are 2 answers

0
Paul P On

It looks like you are correctly passing in a string into get_news() which is then passed on as the first argument (search) into gn.search().

However, you're reassigning search to the result of gn.search() in the line:

  search = gn.search(search, from_=date.strftime('%Y-%m-%d'), to_=(date+delta).strftime('%Y-%m-%d'))
# ^^^^^^
# gets overwritten with the result of gn.search()

In the next iteration this reassigned search is passed into gn.search() which it doesn't like.

If you look at the code in pygooglenews, it looks like gn.search() is returning a dict which would explain the error.

To fix this, simply use a different variable, e.g.:

result = gn.search(search, from_=date.strftime('%Y-%m-%d'), to_=(date+delta).strftime('%Y-%m-%d'))
newsitem = result['entries']
1
Fazi Alnjd On

I know that pygooglenews has a limit of 100 articles, so you must to make a loop in which it will scrape every day separately.