Extracting Disqus comments using Python and Beautiful Soup

625 views Asked by At

This question is similar to the one asked here, but the answer was not of much help.

I am trying to extract comments from a webpage which uses Disqus, however I am not able to access the section.

This is what I have so far, it's not much

import urllib
import urllib2,cookielib
from bs4 import BeautifulSoup
from IPython.display import HTML

site= "http://www.timesofmalta.com/articles/view/20161207/local/daphne-caruana-galizia-among-politicos-28-most-influential.633146"
hdr = {'User-Agent':'Mozilla/5.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)

soup = BeautifulSoup(page,"html.parser")
title = soup.title.text
print title

Any hints as to how I could attempt to tackle this?

1

There are 1 answers

0
dyelamos On

I had the same issue while trying to download an infinity scroll on java. After doing a million things, including beautiful soup, i realized that the best way to tackle this problem was debugging with chrome, to get the URL of the petition that would come out as the dynamic content was loaded, and then find a way to regulate the expression so that i could call it in different ways.

so for example, if when you activate your infinite scroll, you have the chrome debugging console open, you will see an HTTP petition(probably HTTP-get) coming out. If the URL has a structure as:

http:www.yourlink.com/get_comments/product/page_offset_numbertoload/

you will be able to build an http petition with python and send it, get the response, in which the data that you are looking for is stored. Good luck man!