Gathering all top-level comments from r/worldnews live thread

67 views Asked by At

I'm a student trying to get all top-level comments from this r/worldnews live thread: https://www.reddit.com/r/worldnews/comments/1735w17/rworldnews_live_thread_for_2023_israelhamas/ for a school research project. I'm currently coding in Python, using the PRAW API and pandas library. Here's the code I've written so far:

url = "https://www.reddit.com/r/worldnews/comments/1735w17/rworldnews_live_thread_for_2023_israelhamas/"
submission = reddit.submission(url=url)
comments_list = []
def process_comment(comment):
if isinstance(comment, praw.models.Comment) and comment.is_root:
comments_list.append({
'author': comment.author.name if comment.author else '[deleted]',
'body': comment.body,
'score': comment.score,
'edited': comment.edited,
'created_utc': comment.created_utc,
'permalink': f"https://www.reddit.com{comment.permalink}"
})
submission.comments.replace_more(limit=None, threshold=0)
for top_level_comment in submission.comments.list():
process_comment(top_level_comment)
comments_df = pd.DataFrame(comments_list)

But the code times out when limit=None. Using other limits(100,300,500) only returns ~700 comments. Any help in gathering the top-level comments from this Reddit thread would be greatly appreciated.

I've looked at probably hundreds of pages of documentation/Reddit threads and tried the following techniques:

  • Coding a "timeout" for the Reddit API, then after the break, continuing on with gathering comments
  • Gathering comments in batches, then calling replace_more again but to no avail. I've also looked at the Reddit API rate limit request documentation, in hopes that there is a method to bypass these limits.
1

There are 1 answers

0
jeffreyohene On

I was able to pull in 190k+ comments using a recursive function instead of the replace_more method to bypass the timeout issue. Maybe this will help:

 url = "https://www.reddit.com/r/worldnews/comments/1735w17/rworldnews_live_thread_for_2023_israelhamas/"
    submission = reddit.submission(url=url)
    comments_list = []
    
    def process_comment(comment):
        if isinstance(comment, praw.models.Comment) and comment.is_root:
            comments_list.append({
                'author': comment.author.name if comment.author else '[deleted]',
                'body': comment.body,
                'score': comment.score,
                'edited': comment.edited,
                'created_utc': comment.created_utc,
                'permalink': f"https://www.reddit.com{comment.permalink}"
            })
    
    def gather_comments(comment_list):
        for comment in comment_list:
            if isinstance(comment, praw.models.MoreComments):
                try:
                    comment_list = comment_list[:comment_list.index(comment)] + comment.comments() + comment_list[comment_list.index(comment) + 1:]
                except Exception as e:
                    print(f"Error replacing MoreComments: {e}")
            else:
                process_comment(comment)
    
        if any(isinstance(comment, praw.models.MoreComments) for comment in comment_list):
            gather_comments(comment_list)
    

    top_level_comments = submission.comments
    gather_comments(top_level_comments)
    
    # Create DataFrame
    comments_df = pd.DataFrame(comments_list)