Scrapy only outputting an open bracket

110 views Asked by At

I'm trying to scrape the title and URL of all khan academy pages under the math/science/economics pages. However, currently it is only outputting an open bracket, and before this happened it would only scrape the start URL.

from openbar_index.items import OpenBarIndexItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor


class OpenBarSpider(CrawlSpider):
    """
    scrapes website URLs from educational websites and commits urls/webpage names/text to a document
    """

    name = 'openbar'
    allowed_domains = 'khanacademy.org'
    start_urls = [

        "https://www.khanacademy.org"

    ]

     rules = [

            Rule(SgmlLinkExtractor(allow = ['/math/']), callback='parse_item', follow = True),
             Rule(SgmlLinkExtractor(allow = ['/science/']), callback='parse_item', follow=True),
             Rule(SgmlLinkExtractor(allow = ['/economics-finance-domain/']), callback='parse_item', follow=True)
    ]

    def parse_item(self, response):

         item = OpenBarIndexItem()
         url = response.url
         item['url'] = url
         item['title'] = response.xpath('/html/head/title/text()').extract()
         yield item

Does anyone have an idea why this is happening or tips on how to fix it?

1

There are 1 answers

0
Frank Martin On

The problem is the assignment to allowed_domains. This must not be a string but a list according to the documentation. With the string the potentially results are filtered by scrapy as offsite requests because there is no valid domain.

So adding square brackets like in next line should fix it

    allowed_domains = ['khanacademy.org']