The following program does almost everything I want it to but it won't write the image files to disc that are scraped. The latest error has no such file or directory for the basename of one of the image files that I would like to obtain. It should be writing the new file but I guess I'm doing something wrong. Error: No such file or directory - h3130gy1-3-7ec5.jpg . Ideally this program would write each image to disc with the name of each image being the basename of the absolute url that was used to obtain it. I would also like the spreadsheet element to write the basename of each scraped image to the output file that is being compiled.
require "capybara/dsl"
require "spreadsheet"
require "fileutils"
require "open-uri"
LOCAL_DIR = 'data-hold/images'
FileUtils.makedirs(LOCAL_DIR) unless File.exists?LOCAL_DIR
Capybara.run_server = false
Capybara.default_driver = :selenium
Capybara.default_selector = :xpath
Spreadsheet.client_encoding = 'UTF-8'
class Tomtop
include Capybara::DSL
def initialize
@excel = Spreadsheet::Workbook.new
@work_list = @excel.create_worksheet
@row = 0
end
def go
visit_main_link
end
def visit_main_link
visit "http://www.example.com/clothing-accessories?dir=asc&limit=72&order=position"
results = all("//h5/a[contains(@onclick, 'analyticsLog')]")
item = []
results.each do |a|
item << a[:href]
end
item.each do |link|
visit link
save_item
end
@excel.write "inventory.csv"
end
def save_item
data = all("//*[@id='content-wrapper']/div[2]/div/div")
data.each do |info|
@work_list[@row, 0] = info.find("//*[@id='productright']/div/div[1]/h1").text
price = info.first("//div[contains(@class, 'price font left')]")
@work_list[@row, 1] = (price.text.to_f * 1.33).round(2) if price
@work_list[@row, 2] = info.find("//*[@id='productright']/div/div[11]").text
@work_list[@row, 3] = info.find("//*[@id='tabcontent1']/div/div").text.strip
color = info.all("//dd[1]//select[contains(@name, 'options')]//*[@price='0']")
@work_list[@row, 4] = color.collect(&:text).join(', ')
size = info.all("//dd[2]//select[contains(@name, 'options')]//*[@price='0']")
@work_list[@row, 5] = size.collect(&:text).join(', ')
imagelink = info.all("//*[@rel='lightbox[rotation]']")
@work_list[@row, 6] = imagelink.map { |link| link['href'] }.join(', ')
image = imagelink.map { |link| link['href'] }
File.open (File.basename("#{LOCAL_DIR}/#{image}", 'w')) do |f|
f.write(open(image).read)
end
@row = @row + 1
end
end
end
tomtop = Tomtop.new
tomtop.go
It appears as if you have a parenthesis misplaced, this line:
Should be this:
But actually, on further investigation of your code, it appears that File.basename is acting on the incorrect string in this situation. After getting your code to run, it filled the root folder of scraper.rb with images. So, what I think you really want for that line is this:
After running this, I got to the next problem. It appears as though 'image' is an array which contains many urls.
Depending on what you are trying to achieve, you may need to do some additional filtering to get the image down to a single image, or change it to 'images' and have the following code: