Scrape Dynamic Data in Fangraphs

45 views Asked by At

I know very little about programming. I just know that scrape exists, but it's impossible to write code for it, so I tried it with the help of chatGPT.

enter image description here

I want to scrap this information in https://www.fangraphs.com/tools/wpa-inquirer Follow the 5 dropdowns from Run environment to Run differential. The gray background values ​​below change. I would like to collect those values ​​that vary depending on five conditions.

I asked ChatGPT a question, got the code, and ran it. An example of the result I want is as shown in the following photo.enter image description here

However, even though I ran the code, I could not get the desired result. The most positive result was that there were no errors in code execution, but the Home/Away win rate and LI values ​​remained unchanged, possibly because the drop-down value could not be changed.

The code using Selenium was roughly similar to the following.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
import pandas as pd

driver = webdriver.Chrome(options=chrome_options)

url = 'https://www.fangraphs.com/tools/wpa-inquirer'
driver.get(url)

def get_leverage_index(run_env, base_situation, inning, outs, run_differential):
    select_dropdown('rcbRun', run_env)
    select_dropdown('rcbBase', base_situation)
    select_dropdown('rcbInning', inning)
    select_dropdown('rcbOuts', outs)
    select_dropdown('rcbScore', run_differential)

    leverage_index = driver.find_element(By.XPATH, '//td[text()="Leverage Index"]/following-sibling::td').text
    return leverage_index

def select_dropdown(dropdown_id, value):
    input_element = driver.find_element(By.CSS_SELECTOR, f"#{dropdown_id}_Input")
    driver.execute_script("arguments[0].value = arguments[1];", input_element, value)

data = []

run_env_values = ['3.0', '3.5', '4.0', '4.5', '5.0', '5.5', '6.0', '6.5']
base_situation_values = ['_ _ _', '1 _ _', '_ 2 _', '1 2 _', '_ _ 3', '1 _ 3', '_ 2 3', '1 2 3']
inning_values = ['1 (Top)', '1 (Bottom)', '2 (Top)', '2 (Bottom)', '3 (Top)', '3 (Bottom)',
                 '4 (Top)', '4 (Bottom)', '5 (Top)', '5 (Bottom)', '6 (Top)', '6 (Bottom)',
                 '7 (Top)', '7 (Bottom)', '8 (Top)', '8 (Bottom)', '>= 9 (Top)', '>= 9 (Bottom)']
outs_values = ['0', '1', '2']
run_differential_values = ['-10', '-9', '-8', '-7', '-6', '-5', '-4', '-3', '-2', '-1', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10']

progress_count = 0

for run_env in run_env_values:
    for base_situation in base_situation_values:
        for inning in inning_values:
            for outs in outs_values:
                for run_differential in run_differential_values:
                    progress_count += 1
                    print(f'({progress_count}/{len(run_env_values) * len(base_situation_values) * len(inning_values) * len(outs_values) * len(run_differential_values)})')
                    leverage_index = get_leverage_index(run_env, base_situation, inning, outs, run_differential)
                    data.append([run_env, base_situation, inning, outs, run_differential, leverage_index])

driver.quit()

df = pd.DataFrame(data, columns=['Run Environment', 'Base Situation', 'Inning', 'Outs', 'Run Differential', 'Leverage Index'])
df.to_excel('leverage_index_data.xlsx', index=False)
1

There are 1 answers

0
Benjamin Breton On

When you are ready to extract data you can get the html content from the table.

Please let me know if this is what you are looking for :

from bs4 import BeautifulSoup as bs
import pandas as pd

html_content = driver.page_source

soup = bs(html_content, 'html.parser')
tbls = soup.find_all('table')
tbl = tbls[-1]

data = []

for row in tbl.find_all('tr'):
    row_data = []
    for cell in row.find_all('td'):
        input_tag = cell.find('input', class_='rcbInput')
        if input_tag and 'value' in input_tag.attrs:
            row_data.append(input_tag['value'].strip())
        else:
            row_data.append(cell.get_text(strip=True))
    data.append(row_data)

data = [row for row in data if any(row)]
df = pd.DataFrame(data)
print(df)