How to convert scraped HTML document to a dataframe?

53 views Asked by At

I am trying to scrape football players' data from the website FBRef, I got the data from the website as a bs4.element.ResultSet object.

Code:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

res = requests.get("https://fbref.com/en/comps/9/stats/Premier-League-Stats")

comp = re.compile("<!--|-->")
soup = BeautifulSoup(comp.sub("",res.text),'lxml')
all_data = soup.findAll("tbody")
    
player_data = all_data[2]

The data is as follows:

<tr><th class="right" **...** href="/en/players/774cf58b/Max-Aarons">Max Aarons</a></td><td **...** data-stat="position">DF</td><td class="left" data-stat="team"><a href="/en/squads/4ba7cbea/Bournemouth-Stats">Bournemouth</a></td><td class="center" data-stat="age">24-084</td><td class="center" data-stat="birth_year">2000</td><td**...** </a></td></tr>

<tr><th class="right" **...** href="/en/players/77816c91/Benie-Adama-Traore">Bénie Adama Traore</a></td><td **...** data-stat="position">FW,MF</td><td class="left" data-stat="team"><a href="/en/squads/1df6b87e/Sheffield-United-Stats">Sheffield Utd</a></td><td class="center" data-stat="age">21-119</td><td class="center" data-stat="birth_year">2002 **...** </a></td></tr>
**...**

I want to create a Pandas data frame from this such as:

**Name                Position    Team              Age      Birth Year** **...**

Max Aarons            DF          Bournemouth       24       2000

Benie Adama Traore    FW          Sheffield Utd     21       2002
**...**

Looked similar questions here and dried to apply the solutions but couldn't make it work

2

There are 2 answers

0
Daniel Crompton On

To create a Pandas DataFrame from the scraped data, you can iterate over the tags, extract the relevant information from each tag, and then append it to a list. Finally, you can use that list to create the DataFrame. Here's how you can do it:

import requests
from bs4 import BeautifulSoup
import pandas as pd

res = requests.get("https://fbref.com/en/comps/9/stats/Premier-League-Stats")
soup = BeautifulSoup(res.text, 'lxml')

player_data = soup.find_all("tbody")[2]

data = []

for row in player_data.find_all("tr"):
    name = row.find("a").text
    position = row.find("td", {"data-stat": "position"}).text
    team = row.find("td", {"data-stat": "team"}).text
    age = row.find("td", {"data-stat": "age"}).text
    birth_year = row.find("td", {"data-stat": "birth_year"}).text
    
    data.append([name, position, team, age, birth_year])

df = pd.DataFrame(data, columns=['Name', 'Position', 'Team', 'Age', 'Birth Year'])
print(df)

This code will create a DataFrame with columns 'Name', 'Position', 'Team', 'Age', and 'Birth Year' from the scraped data.

0
Andrej Kesely On

I suggest to use pd.read_html to read the HTML code directly to dataframe:

import re
from io import StringIO

import pandas as pd
import requests

res = requests.get("https://fbref.com/en/comps/9/stats/Premier-League-Stats")

comp = re.compile("<!--|-->")
df = pd.read_html(StringIO(comp.sub("", res.text)))[2]  # <-- locate the right table

print(df)

Prints:

    Unnamed: 0_level_0       Unnamed: 1_level_0 Unnamed: 2_level_0 Unnamed: 3_level_0 Unnamed: 4_level_0 Unnamed: 5_level_0 Unnamed: 6_level_0 Playing Time                     Performance                                        Expected                      Progression             Per 90 Minutes                                                               Unnamed: 36_level_0
                    Rk                   Player             Nation                Pos              Squad                Age               Born           MP  Starts   Min   90s         Gls  Ast  G+A  G-PK  PK  PKatt  CrdY  CrdR       xG  npxG  xAG  npxG+xAG        PrgC  PrgP  PrgR            Gls   Ast   G+A  G-PK  G+A-PK    xG   xAG  xG+xAG  npxG  npxG+xAG             Matches
0                    1               Max Aarons            eng ENG                 DF        Bournemouth             24-085               2000           14      12  1085  12.1           0    1    1     0   0      0     1     0      0.0   0.0  0.8       0.8          19    40    22           0.00  0.08  0.08  0.00    0.08  0.00  0.07    0.07  0.00      0.07             Matches
1                    2       Bénie Adama Traore             ci CIV              FW,MF      Sheffield Utd             21-120               2002            8       3   387   4.3           0    0    0     0   0      0     0     0      0.3   0.3  0.5       0.8           7     9    14           0.00  0.00  0.00  0.00    0.00  0.06  0.13    0.19  0.06      0.19             Matches
2                    3              Tyler Adams             us USA                 MF        Bournemouth             25-044               1999            1       0    20   0.2           0    0    0     0   0      0     0     0      0.0   0.0  0.0       0.0           0     1     0           0.00  0.00  0.00  0.00    0.00  0.00  0.00    0.00  0.00      0.00             Matches
3                    4         Tosin Adarabioyo            eng ENG                 DF             Fulham             26-187               1997           15      13  1173  13.0           1    0    1     1   0      0     1     0      0.6   0.6  0.1       0.6           5    39     3           0.08  0.00  0.08  0.08    0.08  0.04  0.01    0.05  0.04      0.05             Matches
4                    5           Elijah Adebayo            eng ENG                 FW         Luton Town             26-082               1998           23      13  1162  12.9           9    0    9     9   0      0     1     0      5.6   5.6  0.7       6.3          14    19    85           0.70  0.00  0.70  0.70    0.70  0.43  0.05    0.49  0.43      0.49             Matches
5                    6            Simon Adingra             ci CIV                 FW           Brighton             22-088               2002           21      16  1446  16.1           6    1    7     6   0      0     2     0      3.1   3.1  2.3       5.4          72    32   199           0.37  0.06  0.44  0.37    0.44  0.19  0.14    0.34  0.19      0.34             Matches

...