for loop only incrementing in range of 10 instead of 1

163 views Asked by At

I am trying to return a list of URLs from a search using google news. I am using the GoogleNews and pandas dataframe modules to organize the results. I am then taking those URLs and downloading the webpages using pywebcopy.

Right now, my for loop increments in groups of 9 instead of 1 at a time, which I believe is the issue when downloading the webpage using the save_webpage function. I believe the save_webpage function can only handle 1 URL at a time. I have no clue how to shorten the range of results returned.

I've tried adjusting the range but (1,1) seems to be the lowest it can go, and that always returns 9 URLs instead of 1.

Here is my code:

from GoogleNews import GoogleNews
from newspaper import Article
import pandas as pd

googlenews=GoogleNews(start = '12/01/2021',end= '12/31/2021')
googlenews.search('test search')
result=googlenews.result()
df=pd.DataFrame(result)

for i in range(1,1):
    googlenews.getpage(i)
    result=googlenews.result()
    df=pd.DataFrame(result)
list = []

for ind in df.index:
    try:
        dict={}
        article = Article(df['link'][ind])
        article.download()
        article.parse()
        dict['Article Title'] = article.title
        dict['Article Text'] = article.text

        url = str(df['link'])
        print(str(url))

        download_folder = 'C:\Test_Data'

        kwargs = {'bypass_robots': True, 'project_name': 'PROJECT'}

        save_webpage(url, download_folder, **kwargs)
        list.append(dict)
    except:
        pass
1

There are 1 answers

0
Nick ODell On

I've tried adjusting the range but (1,1) seems to be the lowest it can go, and that always returns 9 URLs instead of 1.

If you write the following loop, you'll actually get a loop which executes 0 times:

for i in range(1,1):
    print("Looping at index " + str(i))

If you run this, it will not print anything, because it is looping 0 times. A shortcut for figuring how many times a loop will loop is to subtract the start from the end. So, e.g. this loops 1 time, because 2 - 1 = 1:

for i in range(1,2):
    print("Looping at index " + str(i))

So, why are you getting ten results back? The GoogleNews library is designed to fetch one "page" of results at a time. This line fetches one page:

result=googlenews.result()
df=pd.DataFrame(result)

Since the line is outside the loop, even though the loop is not running, the line is still executed.

To fix this, I recommend looping over the page of results, and calling save_webpage() once per article.