I'm trying to scrape the source images from Google's Earth View page, while renaming them to meaningful filenames so that each filename contains the city or country, in addition to its file number.
I achieved this going through the JSON files. I found that I could look up each file by number and get redirected to the appropriate one. For example, /_api/1003.json
redirects to /_api/australia-1003.json
.
However, my tool breaks on files 1354, 1355, 2071, 2090, 2297, 2299, 5597, 6058, raising the following error:
/user/.rvm/rubies/ruby-2.2.1/lib/ruby/2.2.0/open-uri.rb:354:in `rescue in
open_http': 301 Moved Permanently (Invalid Location URI)
(OpenURI::HTTPError)
Checking the files manually, I find that these eight are not automatically redirected in the browser (using Chrome). Each has the message "Moved Permanently" and points to a new location. Interestingly, they each have non-Latin characters in their place names, like عندل-yemen
and vegaøyan-norway
. A quick skim of the ~1500 successfully scraped images does not find any similarly special characters.
Do the Non-Latin names stop redirects? Or, did Earth View stop redirects because of the special characters?
How do I incorporate these skipped eight?
The code:
require 'open-uri'
require 'json'
require 'net/http'
@earthview_range = 1003..7023
def scrape_away
@earthview_range.each do |json_id|
response = Net::HTTP.get_response(URI.parse("http://earthview.withgoogle.com/_api/#{json_id.to_s}.json"))
if response.code.to_i < 400 # Filter 404s etc but allow redirects
# Collect the necessary information
data_hash = JSON.parse(open("http://earthview.withgoogle.com/_api/#{json_id.to_s}.json").read)
new_file_name = data_hash["slug"]
photo_url = data_hash["photoUrl"]
cleanUrl = photo_url[0..3] + photo_url[5..-1]
# Create the properly named files
File.open(new_file_name + '.jpg', 'wb') do |f|
f.write open(cleanUrl).read
end
puts "[✓]"
else
puts "--not found--"
end
end
puts ''
puts "Task complete. Have fun!"
exit
end
Edit: I've cobbled together a serviceable workaround thanks to this post. I manually supplied URL paths for the eight special cases, while using the Addressable gem to normalize the non-Latin characters.
The updated code, as it stands now:
require 'open-uri'
require 'json'
require 'net/http'
require 'addressable/uri'
# Range starts at 0 allowing simple 1-to-1 ID matching for exceptions.
@earthview_ary = (0..7200).to_a
# Special Exception forwarding addresses:
@earthview_ary[1354] = "sanlıurfa-merkez-turkey-1354"
@earthview_ary[1355] = "asagıkaravaiz-turkey-1355"
@earthview_ary[2071] = "عندل-yemen-2071"
@earthview_ary[2090] = "weißwasser-germany-2090"
@earthview_ary[2297] = "vegaøyan-norway-2297"
@earthview_ary[2299] = "herøy-nordland-norway-2299"
@earthview_ary[5597] = "زابل-iran-5597"
@earthview_ary[6058] = "mysove-мисове-crimea-6058"
def scrape_away
# To save time, we'll skip past the gaps I know to be empty.
# Earth View's numbering starts at 1000 and skips from 2450 to 5000.
scrape_ary = @earthview_ary.reject { |n| n.to_i.between?(1, 1000) || n.to_i.between?(2450, 5000) }
# Begin
scrape_ary.each do |json_id|
host = "http://earthview.withgoogle.com"
path = "/_api/#{json_id.to_s}.json"
uri = Addressable::URI.parse(host + path)
# The Addressable gem let's us normalize the non-Latin characters.
# Very important!
response = Net::HTTP.get_response(URI.parse(host + uri.normalized_path))
if response.code.to_i < 400 # Filter 404s etc, but allow redirects
# Collect the necessary information
data_hash = JSON.parse(open(host + uri.normalized_path).read)
new_file_name = data_hash["slug"]
print "#{new_file_name}.jpg... "
photo_url = data_hash["photoUrl"]
cleanUrl = photo_url[0..3] + photo_url[5..-1]
# Create the properly named files
File.open(new_file_name + '.jpg', 'wb') do |f|
f.write open(cleanUrl).read
end
puts "[✓]"
else
puts "--not found--"
end
end
puts ''
puts "Task complete. Have fun!"
exit
end