I am trying to return a list of URLs from a search using google news. I am using the GoogleNews and pandas dataframe modules to organize the results. I am then taking those URLs and downloading the webpages using pywebcopy.
Right now, my for loop increments in groups of 9 instead of 1 at a time, which I believe is the issue when downloading the webpage using the save_webpage function. I believe the save_webpage function can only handle 1 URL at a time. I have no clue how to shorten the range of results returned.
I've tried adjusting the range but (1,1) seems to be the lowest it can go, and that always returns 9 URLs instead of 1.
Here is my code:
from GoogleNews import GoogleNews
from newspaper import Article
import pandas as pd
googlenews=GoogleNews(start = '12/01/2021',end= '12/31/2021')
googlenews.search('test search')
result=googlenews.result()
df=pd.DataFrame(result)
for i in range(1,1):
googlenews.getpage(i)
result=googlenews.result()
df=pd.DataFrame(result)
list = []
for ind in df.index:
try:
dict={}
article = Article(df['link'][ind])
article.download()
article.parse()
dict['Article Title'] = article.title
dict['Article Text'] = article.text
url = str(df['link'])
print(str(url))
download_folder = 'C:\Test_Data'
kwargs = {'bypass_robots': True, 'project_name': 'PROJECT'}
save_webpage(url, download_folder, **kwargs)
list.append(dict)
except:
pass
If you write the following loop, you'll actually get a loop which executes 0 times:
If you run this, it will not print anything, because it is looping 0 times. A shortcut for figuring how many times a loop will loop is to subtract the start from the end. So, e.g. this loops 1 time, because 2 - 1 = 1:
So, why are you getting ten results back? The GoogleNews library is designed to fetch one "page" of results at a time. This line fetches one page:
Since the line is outside the loop, even though the loop is not running, the line is still executed.
To fix this, I recommend looping over the page of results, and calling save_webpage() once per article.