How to loop over a response element in Scrapy?

7.9k views Asked by At

I am trying to code a scraper with Scrapy for Python. At this point, I am trying to get the name of the webpage and all the outbound links within the page. The output should be a dictionary like this

        {'link': [u'Link1'], 'title': [u'Page title']}

I have created this code:

from scrapy.spider import Spider
from scrapy import Selector
from socialmedia.items import SocialMediaItem

class MySpider(Spider):
    name = 'smm'
    allowed_domains = ['*']
    start_urls = ['http://en.wikipedia.org/wiki/Social_media']
    def parse(self, response):
        items =[]
        for link in response.xpath("//a"):
            item = SocialMediaItem()
            item['title'] = link.xpath('text()').extract()
            item['link'] = link.xpath('@href').extract()
            items.append(item)
            yield items

Could anyone help me to get this result? I've adapted the code from this page http://mherman.org/blog/2012/11/05/scraping-web-pages-with-scrapy/

updating the code without the deprecated functions. Thank you so much!

Dani

1

There are 1 answers

3
alecxe On BEST ANSWER

If I understand correctly, you want to iterate all of the links and extract links and titles.

Get all a tags via //a xpath and extract text() and @href:

def parse(self, response):
    for link in response.xpath("//a"):
        item = SocialMediaItem()
        item['title'] = link.xpath('text()').extract()
        item['link'] = link.xpath('@href').extract()
        yield item

This yields:

{'link': [u'#mw-navigation'], 'title': [u'navigation']}
{'link': [u'#p-search'], 'title': [u'search']}
...
{'link': [u'/wiki/Internet_forum'], 'title': [u'Internet forums']}
...

Also, note that there are Link Extractors built-in into Scrapy:

LinkExtractors are objects whose only purpose is to extract links from web pages (scrapy.http.Response objects) which will be eventually followed.