Scrapy - how do I load data from the database in ItemLoader before sending it to the pipeline?

30 views Asked by At

I have a PSQL database table brands where are columns like id, name, and other columns.

My (simplified) code - MySpider.py:

import DB

class MySpider(scrapy.Spider):
  db = DB.connect()
  def start_requests(self):
    urls = [ 'https://www.website.com']
    for url in URLs:
      yield Request(url=url, callback=self.parse, meta=meta)    

  def parse(self, response):
    cars = response.css('...')
    for car in cars:
      item = CarLoader(item=Car(), selector=car)
      data.add_value('brand_id', car.css('...').get())
      ...

items.py:

import scrapy
class Car(scrapy.Item):
    name = scrapy.Field()
    brand_id = scrapy.Field()
    established = scrapy.Field()
    ...

itemsloaders.py:

from itemloaders.processors import TakeFirst, MapCompose
from scrapy.loader import ItemLoader

class CarLoader(ItemLoader):
    default_output_processor = TakeFirst()

When I am saving a new item to the database (that's done in pipeline.py), I don't want to save to the column cars.brand_id the brand name (BMW, Audi, etc.) of the car, but its ID (this ID is stored in brands.id).

What's the proper way of doing that? I need to search the name of the brand in the brands table and the found ID save to cars.brand_id - but where should I place this operation, so it's logical and scrappy-correct?

I was thinking and doing that in MySpider.py, as well as in pipeline.py, but I find it a bit dirty and it does not feel it belongs there.

It seems that this functionality should be placed to itemsloaders.py, but the purpose of this file is a bit mystical to me. How do I resolve this?

0

There are 0 answers