Saving / Editing Scrapped URLs to Directory

Question

Saving / Editing Scrapped URLs to Directory

35 views Asked by kc_balr At 04 May 2023 at 09:52

I have successfully scrapped links from a website and I want to save them to a local folder already created called "HerHoops" for parsing later. In the past, I have successfully done this, but this website's links need a little more cleaning up.

So far this is my code. I want to keep everything after "box_score" in the link so that the saved filename includes the date and teams playing. Also saved in write mode ("w+").

url = f"https://herhoopstats.com/stats/wnba/schedule_date/2004/6/1/"
data = requests.get(url)
soup = BeautifulSoup(data.text)
matchup_table = soup.find_all("div", {"class": "schedule"})[0]

links = matchup_table.find_all('a')
links = [l.get("href") for l in links]
links = [l for l in links if '/box_score/' in l]

box_scores_urls = [f"https://herhoopstats.com{l}" for l in links]

for box_scores_url in box_scores_urls:
      data = requests.get(box_scores_url)
      # within loop opening up page and saving to folder in write mode
      with open("HerHoops/{}".format(box_scores_url[46:]), "w+") as f:
         # write to the files
         f.write(data.text) 
      time.sleep(3)

The error is

FileNotFoundError: [Errno 2] No such file or directory: 'HerHoops/2004/06/01/new-york-liberty-vs-charlotte-sting/'

Original Q&A

There are 1 answers

**Abhay Chaudhary** · Accepted Answer · 2023-05-04T10:16:43+00:00

From the error itself its clear that you are trying to write to the file 'HerHoops/2004/06/01/new-york-liberty-vs-charlotte-sting/', but part of the directory does not exist We can create the necessary directories by using the os.makedirs() function before writing to the file

Full code

import os
import time
import requests
from bs4 import BeautifulSoup
import re
from datetime import datetime

url = f"https://herhoopstats.com/stats/wnba/schedule_date/2004/6/1/"
data = requests.get(url)
soup = BeautifulSoup(data.text)
matchup_table = soup.find_all("div", {"class": "schedule"})[0]

links = matchup_table.find_all('a')
links = [l.get("href") for l in links]
links = [l for l in links if '/box_score/' in l]

box_scores_urls = [f"https://herhoopstats.com{l}" for l in links]

for box_scores_url in box_scores_urls:
    data = requests.get(box_scores_url)
    # extract date and teams from the box_scores_url
    date_str = datetime.strptime(re.sub(r'\D', '', url), "%Y%m%d").strftime("%Y-%m-%d")
    # Get the latter part of the string using slicing
    match = re.search(r'\d+(?!.*\d)', box_scores_url.replace('/', ''))
    teams_str = box_scores_url.replace('/', '')[match.end():]
    # create the directory if it doesn't exist
    directory = f"HerHoops/"
    os.makedirs(directory, exist_ok=True)
    # within loop opening up page and saving to folder in write mode
    with open(f"{directory}{date_str}-{teams_str}.html", "w+") as f:
        # write to the file
        f.write(data.text)
    time.sleep(3)

TechQA.

Saving / Editing Scrapped URLs to Directory

There are 1 answers

Related Questions in PYTHON

Related Questions in WEB-SCRAPING

Related Questions in DIRECTORY

Related Questions in FORMAT

Related Questions in CLEAN-URLS

Popular Questions

Trending Questions