Using Sidekiq with Nokogiri for scraping

579 views Asked by At

I'm using Rails with Nokogiri. I have some heavy scraping tasks that I would like to execute in the background with Sidekiq.

The problem is, I followed the three steps mentioned on sidekiq.org but nothing happened. What am I missing?

What follows is one of my scrapes without using Sidekiq, which works fine, but the main problem is, a couple of scrapes like this makes the page loads very slowly.

#my controller
doc = Nokogiri::HTML(open("http://www.example.com"))
@head = {}
doc.xpath('//div[5]/h3/a').each do |link|
@head[link.text.strip] = link['href']
end

#my view
<% if @head %>
<% @head.each do |key, value| %>
<a href="<%= "#{value}" %>" target='_blank'><%= "#{key}" %></a><% end %>
<% end %>

What follows is my attempt to use Sidekiq:

#my controller
class HomeController < ApplicationController
HardWorker.index_async('index', 1)
end

#my hard_worker
class HardWorker
include Sidekiq::Worker
def index
doc = Nokogiri::HTML(open("http://www.example.com"))
@head = {}
doc.xpath('//div[5]/h3/a').each do |link|
@head[link.text.strip] = link['href']
end
end

#my view
the same
1

There are 1 answers

0
the Tin Man On BEST ANSWER

If you're on a *nix host, I'd recommend running a separate, non-Rails Ruby script that is allowed to talk to the database and update a summary table containing the information you need to return to clients. There is no reason to have it run inside Rails or even to load the Rails stack.

You can use rails runner to run Ruby code:

runner runs Ruby code in the context of Rails non-interactively.

The code will have access to Active Record and will be able to use all the same Rails-like configuration and methods, it just won't load the web-side of the stack, making it much lighter-weight and faster to load.

Use cron to periodically fire off that separate Ruby script, loop through a table, or YAML file, containing the URLs to process and then insert the results.