Reddit search API not giving all results

4.7k views Asked by At
import praw

def get_data_reddit(search):
    username=""
    password=""
    r = praw.Reddit(user_agent='')
    r.login(username,password,disable_warning=True)
    posts=r.search(search, subreddit=None,sort=None, syntax=None,period=None,limit=None)
    title=[]
    for post in posts:
        title.append(post.title)
    print len(title)


search="stackoverflow"
get_data_reddit(search)
        

Ouput=953

Why the limitation?

  1. [Documentation][1] mentions

We can at most get 1000 results from every listing, this is an upstream limitation by reddit. There is nothing we can do to go past this limit. But we may be able to get the results we want with the search() method instead.

Any workaround? I hoping someway to overcome in API, I wrote an scraper for twitter data and find it to be not the most efficient solution.

Same Question:https://github.com/praw-dev/praw/issues/430 [1]: https://praw.readthedocs.org/en/v2.0.15/pages/faq.html Please refer the aformentioned link for related discussion too.

2

There are 2 answers

16
Peter Brittain On BEST ANSWER

Limiting results on a search or list is a common tactic for reducing load on servers. The reddit API is clear that this is what it does (as you have already flagged). However it doesn't stop there...

The API also supports a variation of paged results for listings. Since it is a constantly changing database, they don't provide pages, but instead allow you to pick up where you left off by using the 'after' parameter. This is documented here.

Now, while I'm not familiar with PRAW, I see that the reddit search API conforms to the listing syntax. I think you therefore only need to reissue your search, specifying the extra 'after' parameter (referring to your last result from the first search).

Having subsequently tried it out, it appears PRAW is genuinely returning you all the results you asked for.

As requested by OP, here's the code I wrote to look at the paged results.

import praw

def get_data_reddit(search, after=None):
    r = praw.Reddit(user_agent='StackOverflow example')
    params = {"q": search}
    if after:
        params["after"] = "t3_" + str(after.id)
    posts = r.get_content(r.config['search'] % 'all', params=params, limit=100)
    return posts

search = "stackoverflow"
post = None
count = 0
while True:
    posts = get_data_reddit(search, post)
    for post in posts:
        print(str(post.id))
        count += 1
    print(count)
5
NDevox On

So I would simply loop through a predetermined set of search queries, I'm assuming period is a time period? I'm also not sure what the format for it would be, so the below is largely made up, but you should get the gist.

In which case it would be something like the following

import praw

def get_data_reddit(search):
    username=""
    password=""
    r = praw.Reddit(user_agent='')
    r.login(username,password,disable_warning=True)
    title=[]

    periods = (time1, time2, time3, time4)  # declare a set of times to use in the search query to limit results

    for period in periods:  # loop through the different time points and query the posts from that time.
        posts=r.search(search, subreddit=None,sort=None, syntax=None,period=None,limit=None)  # this now returns a limited search query.

        for post in posts:
            title.append(post.title)  # and append as usual.
    print len(title)


search="stackoverflow"
get_data_reddit(search)