I'm trying to scrape the title and URL of all khan academy pages under the math/science/economics pages. However, currently it is only outputting an open bracket, and before this happened it would only scrape the start URL.
from openbar_index.items import OpenBarIndexItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class OpenBarSpider(CrawlSpider):
"""
scrapes website URLs from educational websites and commits urls/webpage names/text to a document
"""
name = 'openbar'
allowed_domains = 'khanacademy.org'
start_urls = [
"https://www.khanacademy.org"
]
rules = [
Rule(SgmlLinkExtractor(allow = ['/math/']), callback='parse_item', follow = True),
Rule(SgmlLinkExtractor(allow = ['/science/']), callback='parse_item', follow=True),
Rule(SgmlLinkExtractor(allow = ['/economics-finance-domain/']), callback='parse_item', follow=True)
]
def parse_item(self, response):
item = OpenBarIndexItem()
url = response.url
item['url'] = url
item['title'] = response.xpath('/html/head/title/text()').extract()
yield item
Does anyone have an idea why this is happening or tips on how to fix it?
The problem is the assignment to
allowed_domains
. This must not be astring
but alist
according to the documentation. With the string the potentially results are filtered by scrapy as offsite requests because there is no valid domain.So adding square brackets like in next line should fix it