Drop duplicate rows that match the Header Row

35 views Asked by At

Background: I'm pulling NFL data in from a URL year by year and the header row is duplicated multiple times.

Partial Solution: I originally tried the .drop_duplicates() but it left me with two "header" rows (2023 season hasn't finished so future games have different a different header row). I assume because it was dropping duplicates, once there were only two left - they were considered "unique"?

Ugly Solution: I have solved my problem using a key word but I'm sure there's a better way to do this.

The line I'm using is as follows but I'm sure there's a better way:

    scorings_df = scorings_df[~scorings_df['Week'].str.contains("week", case=False, na=False)]

Note the week column can contain both numbers and text depending on the week

The full code is below, note that it is pulling 25 years of data so if you're going to run it - you might want to reduce that to 2/3 as you won't need all that

current_year=2023 

# Scoring History
url_scores = 'https://www.pro-football-reference.com/years/'

Scoring_df=pd.read_html(url_scores)[0]

# Create empty dataframe to store url data
Scoring_df = pd.DataFrame()

# Loop over the required number of years starting this year
for year in range(current_year, current_year -25, -1):
    # Generate the url string for each year    
    year_url_stats=f"{url_scores}{year}/games.htm"
    
    # Fetch the data from that url
    Scores = pd.read_html(year_url_stats)
    
    scorings_df=Scores[0]    # Ensure we're taking the first table on the url
    # Remove the Header Row duplicates (.drop_duplicates() not sufficient) & Future games
    scorings_df = scorings_df[~scorings_df['Week'].str.contains("week", case=False, na=False)]
    # Convert the 'Date' column to datetime for filtering
    scorings_df['TempDate'] = pd.to_datetime(scorings_df['Date'], errors='coerce')
    scorings_df = scorings_df[scorings_df['TempDate'] <= datetime.datetime.now()]
    scorings_df.drop(columns=['TempDate'], inplace=True)  


    # Add a column for 'Year' so we can filter later
    scorings_df['Year'] = year
    
    # Clean up Team names
    scorings_df['Winner/tie'] = scorings_df['Winner/tie'].str.split().str[-1]
    scorings_df['Loser/tie'] = scorings_df['Loser/tie'].str.split().str[-1]

    
    # Reorder columns (reduces from 15 to 13)
    columns_order = ['Year', 'Week', 'Day', 'Date', 'Time', 'Winner/tie', 'Loser/tie', 'PtsW', 'PtsL', 'YdsW', 'TOW', 'YdsL', 'TOL']
    scorings_df = scorings_df[columns_order]
    
    
    # Append to the dataframe
    Scoring_df = pd.concat([Scoring_df, scorings_df], ignore_index=True)
    
print(Scoring_df.head)
1

There are 1 answers

1
Mahboob Nur On BEST ANSWER

You can utilize the Pandas drop_duplicates() function in a different way. import pandas as pd import datetime

scoring_df = pd.DataFrame()  

for year in range(current_year, current_year - 25, -1):
    year_url_stats = f"{url_scores}{year}/games.htm"
    scores = pd.read_html(year_url_stats)
    scorings_df = scores[0]
    scorings_df = scorings_df[~scorings_df.apply(tuple, axis=1).duplicated()]
    scorings_df['TempDate'] = pd.to_datetime(scorings_df['Date'], errors='coerce')
    scorings_df = scorings_df[scorings_df['TempDate'] <= datetime.datetime.now()]
    scorings_df.drop(columns=['TempDate'], inplace=True)
    scorings_df['Year'] = year
    scorings_df['Winner/tie'] = scorings_df['Winner/tie'].str.split().str[-1]
    scorings_df['Loser/tie'] = scorings_df['Loser/tie'].str.split().str[-1]
    scoring_df = scoring_df.append(scorings_df)