Scraperwiki: how to save data into one cell in table

204 views Asked by At

Here is my code for the scraper that is extracting the URL and corresponding comments from that particular page:

import scraperwiki
import lxml.html
from BeautifulSoup import BeautifulSoup
import urllib2
import re

for num in range(1,2):
    html_page = urllib2.urlopen("https://success.salesforce.com/ideaSearch?keywords=error&pageNo="+str(num))
    soup = BeautifulSoup(html_page)
    for i in range(0,10):
        for link in soup.findAll('a',{'id':'search:ForumLayout:searchForm:itemObj2:'+str(i)+':idea:recentIdeasComponent:profileIdeaTitle'}):
             pageurl = link.get('href')
             html = scraperwiki.scrape(pageurl)
             root = lxml.html.fromstring(html)

             for j in range(0,300):
                 for table in root.cssselect("span[id='ideaView:ForumLayout:ideaViewForm:cmtComp:ideaComments:cmtLoop:"+str(j)+":commentBodyOutput'] table"):
                     divx = table.cssselect("div[class='htmlDetailElementDiv']")
                     if len(divx)==1:
                         data = {
                             'URL' : pageurl,
                             'Comment' : divx[0].text_content()
                         }
                         print data


         scraperwiki.sqlite.save(unique_keys=['URL'], data=data)
         scraperwiki.sqlite.save(unique_keys=['Comment'], data=data)

When the data is saved to the scraperwiki datastore only the last comment from one URL is put into the table. What I would like is in the table for each URL to have all the comments saved. So, in one column there is the URL and in the second column there are all the comments from that URL, instead of just the last comment, which is what this code ends up with.

1

There are 1 answers

1
zhangyangyu On

As I can see from your code, you put the data in the most inner for loop and assign it a new value every time. So when the for loop ends and goes to the save step, data will contain the last comment. I think you may use:

for i in range(0,10):
        for link in soup.findAll('a',{'id':'search:ForumLayout:searchForm:itemObj2:'+str(i)+':idea:recentIdeasComponent:profileIdeaTitle'}):
             pageurl = link.get('href')
             html = scraperwiki.scrape(pageurl)
             root = lxml.html.fromstring(html)
             data = {'URL': pageurl, 'Comment':[]}

             for j in range(0,300):
                 for table in root.cssselect("span[id='ideaView:ForumLayout:ideaViewForm:cmtComp:ideaComments:cmtLoop:"+str(j)+":commentBodyOutput'] table"):
                     divx = table.cssselect("div[class='htmlDetailElementDiv']")
                     if len(divx)==1:
                         data['Comment'].append(divx[0].text_content)

         scraperwiki.sqlite.save(unique_keys=['URL'], data=data)
         scraperwiki.sqlite.save(unique_keys=['Comment'], data=data)