I am trying to code a scraper with Scrapy for Python. At this point, I am trying to get the name of the webpage and all the outbound links within the page. The output should be a dictionary like this
{'link': [u'Link1'], 'title': [u'Page title']}
I have created this code:
from scrapy.spider import Spider
from scrapy import Selector
from socialmedia.items import SocialMediaItem
class MySpider(Spider):
name = 'smm'
allowed_domains = ['*']
start_urls = ['http://en.wikipedia.org/wiki/Social_media']
def parse(self, response):
items =[]
for link in response.xpath("//a"):
item = SocialMediaItem()
item['title'] = link.xpath('text()').extract()
item['link'] = link.xpath('@href').extract()
items.append(item)
yield items
Could anyone help me to get this result? I've adapted the code from this page http://mherman.org/blog/2012/11/05/scraping-web-pages-with-scrapy/
updating the code without the deprecated functions. Thank you so much!
Dani
If I understand correctly, you want to iterate all of the links and extract links and titles.
Get all
a
tags via//a
xpath and extracttext()
and@href
:This yields:
Also, note that there are
Link Extractors
built-in into Scrapy: