No schema supplied and other errors with using requests.get()

152.5k views Asked by At

I'm learning Python by following Automate the Boring Stuff. This program is supposed to go to http://xkcd.com/ and download all the images for offline viewing.

I'm on version 2.7 and Mac.

For some reason, I'm getting errors like "No schema supplied" and errors with using request.get() itself.

Here is my code:

# Saves the XKCD comic page for offline read

import requests, os, bs4, shutil

url = 'http://xkcd.com/'

if os.path.isdir('xkcd') == True: # If xkcd folder already exists
    shutil.rmtree('xkcd') # delete it
else: # otherwise
    os.makedirs('xkcd') # Creates xkcd foulder.


while not url.endswith('#'): # If there are no more posts, it url will endswith #, exist while loop
    # Download the page
    print 'Downloading %s page...' % url
    res = requests.get(url) # Get the page
    res.raise_for_status() # Check for errors

    soup = bs4.BeautifulSoup(res.text) # Dowload the page
    # Find the URL of the comic image
    comicElem = soup.select('#comic img') # Any #comic img it finds will be saved as a list in comicElem
    if comicElem == []: # if the list is empty
        print 'Couldn\'t find the image!'
    else:
        comicUrl = comicElem[0].get('src') # Get the first index in comicElem (the image) and save to
        # comicUrl

        # Download the image
        print 'Downloading the %s image...' % (comicUrl)
        res = requests.get(comicUrl) # Get the image. Getting something will always use requests.get()
        res.raise_for_status() # Check for errors

        # Save image to ./xkcd
        imageFile = open(os.path.join('xkcd', os.path.basename(comicUrl)), 'wb')
        for chunk in res.iter_content(10000):
            imageFile.write(chunk)
        imageFile.close()
    # Get the Prev btn's URL
    prevLink = soup.select('a[rel="prev"]')[0]
    # The Previous button is first <a rel="prev" href="/1535/" accesskey="p">&lt; Prev</a>
    url = 'http://xkcd.com/' + prevLink.get('href')
    # adds /1535/ to http://xkcd.com/

print 'Done!'

Here are the errors:

Traceback (most recent call last):
  File "/Users/XKCD.py", line 30, in <module>
    res = requests.get(comicUrl) # Get the image. Getting something will always use requests.get()
  File "/Library/Python/2.7/site-packages/requests/api.py", line 69, in get
    return request('get', url, params=params, **kwargs)
  File "/Library/Python/2.7/site-packages/requests/api.py", line 50, in request
    response = session.request(method=method, url=url, **kwargs)
  File "/Library/Python/2.7/site-packages/requests/sessions.py", line 451, in request
    prep = self.prepare_request(req)
  File "/Library/Python/2.7/site-packages/requests/sessions.py", line 382, in prepare_request
    hooks=merge_hooks(request.hooks, self.hooks),
  File "/Library/Python/2.7/site-packages/requests/models.py", line 304, in prepare
    self.prepare_url(url, params)
  File "/Library/Python/2.7/site-packages/requests/models.py", line 362, in prepare_url
    to_native_string(url, 'utf8')))
requests.exceptions.MissingSchema: Invalid URL '//imgs.xkcd.com/comics/the_martian.png': No schema supplied. Perhaps you meant http:////imgs.xkcd.com/comics/the_martian.png?

The thing is I've been reading the section in the book about the program multiple times, reading the Requests doc, as well as looking at other questions on here. My syntax looks right.

Thanks for your help!

Edit:

This didn't work:

comicUrl = ("http:"+comicElem[0].get('src')) 

I thought adding the http: before would get rid of the no schema supplied error.

6

There are 6 answers

6
Ajay On BEST ANSWER

change your comicUrl to this

comicUrl = comicElem[0].get('src').strip("http://")
comicUrl="http://"+comicUrl
if 'xkcd' not in comicUrl:
    comicUrl=comicUrl[:7]+'xkcd.com/'+comicUrl[7:]

print "comic url",comicUrl
5
John On

No schema means you haven't supplied the http:// or https:// supply these and it will do the trick.

Edit: Look at this URL string!:

URL '//imgs.xkcd.com/comics/the_martian.png':

0
Shiva Gupta On

Explanation:

A few XKCD pages have special content that isn’t a simple image file. That’s fine; you can just skip those. If your selector doesn’t find any elements, then soup.select('#comic img') will return a blank list.

Working Code:

import requests,os,bs4,shutil

url='http://xkcd.com'

#making new folder
if os.path.isdir('xkcd') == True:
    shutil.rmtree('xkcd')
else:
    os.makedirs('xkcd')


#scrapiing information
while not url.endswith('#'):
    print('Downloading Page %s.....' %(url))
    res = requests.get(url)          #getting page
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text)

    comicElem = soup.select('#comic img')     #getting img tag under  comic divison
    if comicElem == []:                        #if not found print error
        print('could not find comic image')

    else:
        try:
            comicUrl = 'http:' + comicElem[0].get('src')             #getting comic url and then downloading its image
            print('Downloading image %s.....' %(comicUrl))
            res = requests.get(comicUrl)
            res.raise_for_status()

        except requests.exceptions.MissingSchema:
        #skip if not a normal image file
            prev = soup.select('a[rel="prev"]')[0]
            url = 'http://xkcd.com' + prev.get('href')
            continue

        imageFile = open(os.path.join('xkcd',os.path.basename(comicUrl)),'wb')     #write  downloaded image to hard disk
        for chunk in res.iter_content(10000):
            imageFile.write(chunk)
        imageFile.close()

        #get previous link and update url
        prev = soup.select('a[rel="prev"]')[0]
        url = "http://xkcd.com" + prev.get('href')


print('Done...')
0
Pepe is Sad On

I have a simmilar issure. it somehow take the responsecode 400 as url to parse from so its obvious that the url is invalid. here my code and error:

import cloudscraper  # to bypass cloudflare that is blocking requests with the request module
import time
import random
import json
import socket
from collections import OrderedDict
from requests import Session
 
 
with open("conf.json") as conf:
    config = json.load(conf)
    addon_api = config.get("Addon API")
    addonapi_url = config.get("Addon URL")
    addonapi_ip = config.get("Addon IP")
    addonapi_agent = config.get("Addon User-agent")
 
 
    # getip = socket.getaddrinfo("https://my.url.com", 443)
    # (family, type, proto, canonname, (address, port)) = getip[0]
    # family, type, proto, canonname, (address, port)) = getip[0]
 
    session = Session()
    headers = OrderedDict({
        'Accept-Encoding': 'gzip, deflate, br',
        'Host': addonapi_ip,
        'User-Agent': addonapi_agent
    })
    session.headers = headers
 
    # define the Data we will post to the Website
    data = {
        "apikey": addon_api,
        "action": "get_user_info",
        "value": "username"
    }
 
    try:  # try-block to handle exceptions if the request Failed
        randomsleep1 = random.randint(10, 30)
        randomsleep2 = random.randint(10, 30)
        randomsleep_total = randomsleep1 + randomsleep2
 
 
        data_variable = data
        headers_variable = headers
        payload = {"key1": addonapi_ip, "key2": data_variable, "key3": headers_variable}
 
        getrequest = session.get(url=addonapi_ip, data=data_variable, headers=headers_variable, params = payload)
        postrequest = session.get(url=addonapi_ip, data=data_variable, headers=headers_variable, params = payload)  # sending Data to the Website
        print(addonapi_ip)
 
        scraper = cloudscraper.create_scraper()  # returns a CloudScraper instance
        print(f"Sleeping for {randomsleep1} Seconds before posting Data to API!")
        time.sleep(randomsleep1)
        session.get(postrequest)  # sending Data to the Website
        print(f"Sleeping for {randomsleep2} Seconds before getting Data from API!")
        time.sleep(randomsleep2)
        print(f"Total Seconds i slept during the Request: {randomsleep_total}")
        session.post(postrequest)
        print(f"Data sent: {postrequest}")
        print(f"Data recived: {getrequest}")  # printing the output from the Request into our Terminal
 
 
    #    post = requests.post(addonapi_url, data=data, headers=headers)
    #    print(post.status_code)
    #    print(post.text)
 
    except Exception as e:
        raise e
        # print(e)  # print a error if occurced
# =========================================== #
Sleeping for 15 Seconds before posting Data to API!
Traceback (most recent call last):
  File "C:\Users\You.Dont.See.My.Name\PythonProjects\addon_bot\addon.py", line 69, in <module>
    raise e
  File "C:\Users\You.Dont.See.My.Name\PythonProjects\addon_bot\addon.py", line 55, in <module>
    session.get(postrequest)  # sending Data to the Website
  File "P:\Documents\IT\Python\lib\site-packages\requests\sessions.py", line 546, in get
    return self.request('GET', url, **kwargs)
  File "P:\Documents\IT\Python\lib\site-packages\requests\sessions.py", line 519, in request
    prep = self.prepare_request(req)
  File "P:\Documents\IT\Python\lib\site-packages\requests\sessions.py", line 452, in prepare_request
    p.prepare(
  File "P:\Documents\IT\Python\lib\site-packages\requests\models.py", line 313, in prepare
    self.prepare_url(url, params)
  File "P:\Documents\IT\Python\lib\site-packages\requests\models.py", line 387, in prepare_url
    raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '<Response [400]>': No schema supplied. Perhaps you meant http://<Response [400]>?
0
Rahul_Ramachandran On

Actually this is not a bigdeal.you can see the comicUrl somewhat like this //imgs.xkcd.com/comics/acceptable_risk.png

The only thing you need to add is http: , remember it is http: and not http:// as some folks said earlier because already the url contatin double slashes. so please change the code to

res = requests.get('http:' + comicElem[0].get('src'))

or

comicUrl = 'http:' + comicElem[0].get('src')

res = requests.get(comicUrl)

Happy coding

0
easy_c0mpany80 On

Id just like to chime in here that I had this exact same error and used @Ajay recommended answer above but even after adding that I as still getting problems, right after the program downloaded the first image it would stop and return this error:

ValueError: Unsupported or invalid CSS selector: "a[rel"

this was referring to one of the last lines in the program where it uses the 'Prev button' to go to the next image to download.

Anyway after going through the bs4 docs I made a slight change as follows and it seems to work just fine now:

prevLink = soup.select('a[rel^="prev"]')[0]

Someone else might run into the same problem so thought Id add this comment.