Non-Latin characters or 301 handling? Scraping jpgs and json keys from Google Earth View

131 views Asked by At

I'm trying to scrape the source images from Google's Earth View page, while renaming them to meaningful filenames so that each filename contains the city or country, in addition to its file number.

I achieved this going through the JSON files. I found that I could look up each file by number and get redirected to the appropriate one. For example, /_api/1003.json redirects to /_api/australia-1003.json.

However, my tool breaks on files 1354, 1355, 2071, 2090, 2297, 2299, 5597, 6058, raising the following error:

/user/.rvm/rubies/ruby-2.2.1/lib/ruby/2.2.0/open-uri.rb:354:in `rescue in
open_http': 301 Moved Permanently (Invalid Location URI)
(OpenURI::HTTPError)

Checking the files manually, I find that these eight are not automatically redirected in the browser (using Chrome). Each has the message "Moved Permanently" and points to a new location. Interestingly, they each have non-Latin characters in their place names, like عندل-yemen and vegaøyan-norway. A quick skim of the ~1500 successfully scraped images does not find any similarly special characters.

Do the Non-Latin names stop redirects? Or, did Earth View stop redirects because of the special characters?

How do I incorporate these skipped eight?

The code:

require 'open-uri'
require 'json'
require 'net/http'

@earthview_range = 1003..7023

def scrape_away
  @earthview_range.each do |json_id|
    response = Net::HTTP.get_response(URI.parse("http://earthview.withgoogle.com/_api/#{json_id.to_s}.json"))
    if response.code.to_i < 400   # Filter 404s etc but allow redirects

      # Collect the necessary information
      data_hash = JSON.parse(open("http://earthview.withgoogle.com/_api/#{json_id.to_s}.json").read)
      new_file_name = data_hash["slug"]  
      photo_url = data_hash["photoUrl"]  
      cleanUrl = photo_url[0..3] + photo_url[5..-1]

      # Create the properly named files
      File.open(new_file_name + '.jpg', 'wb') do |f|
        f.write open(cleanUrl).read
      end
      puts "[✓]"
    else
      puts "--not found--"
    end
  end
  puts ''
  puts "Task complete. Have fun!"
  exit
end

Edit: I've cobbled together a serviceable workaround thanks to this post. I manually supplied URL paths for the eight special cases, while using the Addressable gem to normalize the non-Latin characters.

The updated code, as it stands now:

require 'open-uri'
require 'json'
require 'net/http'
require 'addressable/uri'

# Range starts at 0 allowing simple 1-to-1 ID matching for exceptions.
@earthview_ary = (0..7200).to_a

# Special Exception forwarding addresses:
@earthview_ary[1354] = "sanlıurfa-merkez-turkey-1354"
@earthview_ary[1355] = "asagıkaravaiz-turkey-1355"
@earthview_ary[2071] = "عندل-yemen-2071"
@earthview_ary[2090] = "weißwasser-germany-2090"
@earthview_ary[2297] = "vegaøyan-norway-2297"
@earthview_ary[2299] = "herøy-nordland-norway-2299"
@earthview_ary[5597] = "زابل-iran-5597"
@earthview_ary[6058] = "mysove-мисове-crimea-6058"

def scrape_away
  # To save time, we'll skip past the gaps I know to be empty.
  # Earth View's numbering starts at 1000 and skips from 2450 to 5000.
  scrape_ary = @earthview_ary.reject { |n| n.to_i.between?(1, 1000) || n.to_i.between?(2450, 5000) }

  # Begin
  scrape_ary.each do |json_id|
    host = "http://earthview.withgoogle.com"
    path = "/_api/#{json_id.to_s}.json"
    uri = Addressable::URI.parse(host + path)
    # The Addressable gem let's us normalize the non-Latin characters.
    # Very important!
    response = Net::HTTP.get_response(URI.parse(host + uri.normalized_path))
    if response.code.to_i < 400   # Filter 404s etc, but allow redirects

      # Collect the necessary information
      data_hash = JSON.parse(open(host + uri.normalized_path).read)
      new_file_name = data_hash["slug"]
      print "#{new_file_name}.jpg... "
      photo_url = data_hash["photoUrl"]
      cleanUrl = photo_url[0..3] + photo_url[5..-1]

      # Create the properly named files
      File.open(new_file_name + '.jpg', 'wb') do |f|
        f.write open(cleanUrl).read
      end
      puts "[✓]"
    else
      puts "--not found--"
    end
  end
  puts ''
  puts "Task complete. Have fun!"
  exit
end
0

There are 0 answers