Create a pandas DataFrame from unstructured data (end-of-season football tables)

79 views Asked by At

basically I have a lot of data that represents the final standings table for each La Liga season starting from 1950. The structure of the table remains relatively consistent across different seasons. It looks something like this

1.FC Barcelona                  38  28  9  1  99-29  93       Champions
 2.Atlético de Madrid            38  23 10  5  58-22  79
 3.Real Madrid CF                38  22 10  6  94-44  76  [C]
 4.Valencia CF                   38  22  7  9  65-38  73
 5.Villarreal CF                 38  18  7 13  57-50  61
 6.Real Betis Balompié (Sevilla) 38  18  6 14  60-61  60
 7.Sevilla FC                    38  17  7 14  49-58  58
 8.Getafe CF                     38  15 10 13  42-33  55  [P]
 9.SD Eibar                      38  14  9 15  44-50  51       [2 2 0 0 8-2 6]
10.Girona FC                     38  14  9 15  50-59  51  [P]  [2 0 0 2 2-8 0]
11.RCD Espanyol (Barcelona)      38  12 13 13  36-42  49       [4 2 2 0 7-5 8]
12.Real Sociedad (San Sebastián) 38  14  7 17  66-59  49       [4 1 1 2 6-7 4]
13.RC Celta (Vigo)               38  13 10 15  59-60  49       [4 1 1 2 7-8 3]
14.Deportivo Alavés (Vitoria)    38  15  2 21  40-50  47
15.Levante UD (Valencia)         38  11 13 14  44-58  46  [P]
16.Athletic de Bilbao            38  10 13 15  41-49  43       [2 1 0 1 2-1 3]
17.CD Leganés                    38  12  7 19  34-51  43       [2 1 0 1 1-2 3]**
--------------------------------------------------------**
18.RC Deportivo (La Coruña)      38   6 11 21  38-76  29       Relegated
19.UD Las Palmas                 38   5  7 26  24-74  22       Relegated
20.Málaga CF                     38   5  5 28  24-61  20       Relegated

I want to convert this data (unstructured text) into pandas Data Frame, and I am primarily interested in three columns: position, team and points. So what I hope the outcome to look would be this: -> Expected outcome

I can create the first two columns without issues because they follow a clear structure: integer -> '.' -> text. But I'm struggling extracting 'end-of-season points' from all the other numbers that are irrelevant.

Maybe someone knows how to approach this. Thanks!

Here's my logic:

import re
from tabulate import tabulate
import pandas as pd

# Raw data provided
raw_data = my_unstructured_text

# Remove lines of hyphens
raw_data = re.sub(r'-{40,}', '', raw_data)

# Remove brackets and content inside them
raw_data = re.sub(r'\[.*?\]', '', raw_data)

# Remove "Champions" and "Relegated" words
raw_data = raw_data.replace('Champions', '').replace('Relegated', '')

# Split the data into rows
rows = raw_data.strip().split('\n')

# Split each row into columns
cleaned_data = [re.split(r'\s{2,}', row.strip()) for row in rows]

# Split the first column into position and team name
cleaned_data = [col[0].split('.', 1) + col[1:] for col in cleaned_data if col]

# Define column headers
headers = ['pos', 'tm', 'played', 'a', 'b', 'c', 'gd', 'pts']

# Create a DataFrame
df = pd.DataFrame(cleaned_data, columns=headers)

print(tabulate(df, headers='keys', tablefmt='psql'))
2

There are 2 answers

0
Timeless On BEST ANSWER

You can findall the relevant columns (using this regex), then make a DataFrame :

import re

pat = (
    r"(\d+)\.(.+?)"  # get pos & team
    r"\s+(?=\d).+?"  # skip middle columns
    r"(\d+)(?=$|\s+(?:\[|[A-Z]))"  # get points
)

df = pd.DataFrame(
    re.findall(pat, my_unstructured_text, flags=re.M),
    columns=["pos", "tm", "pts"],
)

Output :

print(df.to_string(index=False))

pos                            tm pts
  1                  FC Barcelona  93
  2            Atlético de Madrid  79
  3                Real Madrid CF  76
  4                   Valencia CF  73
  5                 Villarreal CF  61
  6 Real Betis Balompié (Sevilla)  60
  7                    Sevilla FC  58
  8                     Getafe CF  55
  9                      SD Eibar  51
 10                     Girona FC  51
 11      RCD Espanyol (Barcelona)  49
 12 Real Sociedad (San Sebastián)  49
 13               RC Celta (Vigo)  49
 14    Deportivo Alavés (Vitoria)  47
 15         Levante UD (Valencia)  46
 16            Athletic de Bilbao  43
 17                    CD Leganés  43
 18      RC Deportivo (La Coruña)  29
 19                 UD Las Palmas  22
 20                     Málaga CF  20
0
e-motta On

You should change your approach and parse regex groups instead. Then you can use the groups to extract the data you need. You can read more about here.

import pandas as pd
import re

# Define the regex pattern to get all the regex groups in the data
pattern = r"^(\s*\d+)\.([^\d]+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)-(\d+)\s+(\d+)(?:\s+\[.*\])?\s*"

# Initialize lists to store data
positions = []
teams = []
total_points = []

# Iterate through each row of data
for row in data.strip().split("\n"):
    match = re.match(pattern, row)
    if match:
        # Extract data from the match object
        position = int(match.group(1))
        team = match.group(2).strip()
        points = int(match.group(9))

        # Append the extracted data to the respective lists
        positions.append(position)
        teams.append(team)
        total_points.append(points)

# Create a pandas DataFrame
df = pd.DataFrame(
    {
        "Position": positions,
        "Team": teams,
        "Points": total_points,
    }
)

print(tabulate(df, headers="keys", tablefmt="psql"))

The result will be:

+----+------------+-------------------------------+----------+
|    |   Position | Team                          |   Points |
|----+------------+-------------------------------+----------|
|  0 |          1 | FC Barcelona                  |       93 |
|  1 |          2 | Atlético de Madrid            |       79 |
|  2 |          3 | Real Madrid CF                |       76 |
|  3 |          4 | Valencia CF                   |       73 |
|  4 |          5 | Villarreal CF                 |       61 |
|  5 |          6 | Real Betis Balompié (Sevilla) |       60 |
|  6 |          7 | Sevilla FC                    |       58 |
|  7 |          8 | Getafe CF                     |       55 |
|  8 |          9 | SD Eibar                      |       51 |
|  9 |         10 | Girona FC                     |       51 |
| 10 |         11 | RCD Espanyol (Barcelona)      |       49 |
| 11 |         12 | Real Sociedad (San Sebastián) |       49 |
| 12 |         13 | RC Celta (Vigo)               |       49 |
| 13 |         14 | Deportivo Alavés (Vitoria)    |       47 |
| 14 |         15 | Levante UD (Valencia)         |       46 |
| 15 |         16 | Athletic de Bilbao            |       43 |
| 16 |         17 | CD Leganés                    |       43 |
| 17 |         18 | RC Deportivo (La Coruña)      |       29 |
| 18 |         19 | UD Las Palmas                 |       22 |
| 19 |         20 | Málaga CF                     |       20 |
+----+------------+-------------------------------+----------+

You can use the other groups to get the rest of the data.