I am trying to extract information in certain links, but I don't get to go to the links, I extract from the start_url and I am not sure why.
Here is my code:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from tutorial.items import DmozItem
from scrapy.selector import HtmlXPathSelector
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python"
]
rules = [Rule(SgmlLinkExtractor(allow=[r'Books']), callback='parse')]
def parse(self, response):
hxs = HtmlXPathSelector(response)
item = DmozItem()
# Extract links
item['link'] = hxs.select("//li/a/text()").extract() # Xpath selector for tag(s)
print item['title']
for cont, i in enumerate(item['link']):
print "link: ", cont, i
I don't get the links from "http://www.dmoz.org/Computers/Programming/Languages/Python/Books", instead I get the links from "http://www.dmoz.org/Computers/Programming/Languages/Python".
Why?
For
rules
to work, you need to use CrawlSpider not the general scrapy Spider.Also, you need to rename your first parsing function to a name other than
parse
. Otherwise, you will be overwriting an important method of the CrawlSpider and it will not work. See the warning in the docs http://doc.scrapy.org/en/0.24/topics/spiders.html?highlight=rules#crawlspiderYour code was scraping the links from "http://www.dmoz.org/Computers/Programming/Languages/Python" because the
rule
command was being ignored by the general Spider.This code should work: