Scraping MLB daily lineups from rotowire using python

43 views Asked by At

I am trying to scrape the MLB daily lineup information from here: https://www.rotowire.com/baseball/daily-lineups.php

I am trying to use python with requests, BeautifulSoup and pandas.

My ultimate goal is to end up with two pandas data frames.

First is a starting pitching data frame:

date game_time pitcher_name team lineup_throws
2024-03-29 1:40 PM ET Spencer Strider ATL R
2024-03-29 1:40 PM ET Zack Wheeler PHI R

Second is a starting batter data frame:

date game_time batter_name team pos batting_order lineup_bats
2024-03-29 1:40 PM ET Ronald Acuna ATL RF 1 R
2024-03-29 1:40 PM ET Ozzie Albies ATL 2B 2 S
2024-03-29 1:40 PM ET Austin Riley ATL 3B 3 R
2024-03-29 1:40 PM ET Kyle Schwarber PHI DH 1 L
2024-03-29 1:40 PM ET Trea Turner PHI SS 2 R
2024-03-29 1:40 PM ET Bryce Harper PHI 1B 3 L

This would be for all game for a given day.

I've tried adapting this answer to my needs but can't seem to get it to quite work: Scraping Web data using BeautifulSoup

Any help or guidance is greatly appreciated.

Here is the code from the link I am trying to adapt, but can't seem to make progress:

import pandas as pd
import requests
from bs4 import BeautifulSoup


url = "https://www.rotowire.com/baseball/daily-lineups.php"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

weather = []

for tag in soup.select(".lineup__bottom"):
    header = tag.find_previous(class_="lineup__teams").get_text(
        strip=True, separator=" vs "
    )
    rain = tag.select_one(".lineup__weather-text > b")
    forecast_info = rain.next_sibling.split()
    temp = forecast_info[0]
    wind = forecast_info[2]

    weather.append(
        {"Header": header, "Rain": rain.text.split()[0], "Temp": temp, "Wind": wind}
    )


df = pd.DataFrame(weather)
print(df)

The information I want seems to be contained in lineup__main and not in lineup__bottom.

1

There are 1 answers

0
HedgeHog On BEST ANSWER

You have to iterate the boxes and select all your expected features.

import pandas as pd
import requests
from bs4 import BeautifulSoup


url = "https://www.rotowire.com/baseball/daily-lineups.php"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

data_pitiching = []
data_batter = []
team_type = ''

for e in soup.select('.lineup__box ul li'):
    if team_type != e.parent.get('class')[-1]:
        order_count = 1
        team_type = e.parent.get('class')[-1]

    if e.get('class') and 'lineup__player-highlight' in e.get('class'):
        data_pitiching.append({
            'date': e.find_previous('main').get('data-gamedate'),
            'game_time': e.find_previous('div', attrs={'class':'lineup__time'}).get_text(strip=True),
            'pitcher_name':e.a.get_text(strip=True),
            'team':e.find_previous('div', attrs={'class':team_type}).next.strip(),
            'lineup_throws':e.span.get_text(strip=True)
        })
    elif e.get('class') and 'lineup__player' in e.get('class'):
        data_batter.append({
            'date': e.find_previous('main').get('data-gamedate'),
            'game_time': e.find_previous('div', attrs={'class':'lineup__time'}).get_text(strip=True),
            'pitcher_name':e.a.get_text(strip=True),
            'team':e.find_previous('div', attrs={'class':team_type}).next.strip(),
            'pos': e.div.get_text(strip=True),
            'batting_order':order_count,
            'lineup_bats':e.span.get_text(strip=True)
        })
        order_count+=1

df_pitching = pd.DataFrame(data_pitiching)
df_batter = pd.DataFrame(data_batter)
date game_time pitcher_name team lineup_throws
0 2024-03-29 1:40 PM ET Freddy Peralta Brewers R
1 2024-03-29 1:40 PM ET Jose Quintana Mets L
..
19 2024-03-29 10:10 PM ET Bobby Miller Dodgers R
date game_time pitcher_name team pos batting_order lineup_bats
0 2024-03-29 1:40 PM ET J. Chourio Brewers RF 1 R
1 2024-03-29 1:40 PM ET W. Contreras Brewers C 2 R
...
178 2024-03-29 10:10 PM ET E. Hernandez Dodgers CF 8 R
179 2024-03-29 10:10 PM ET Gavin Lux Dodgers 2B 9 L