basically I have a lot of data that represents the final standings table for each La Liga season starting from 1950. The structure of the table remains relatively consistent across different seasons. It looks something like this
1.FC Barcelona 38 28 9 1 99-29 93 Champions
2.Atlético de Madrid 38 23 10 5 58-22 79
3.Real Madrid CF 38 22 10 6 94-44 76 [C]
4.Valencia CF 38 22 7 9 65-38 73
5.Villarreal CF 38 18 7 13 57-50 61
6.Real Betis Balompié (Sevilla) 38 18 6 14 60-61 60
7.Sevilla FC 38 17 7 14 49-58 58
8.Getafe CF 38 15 10 13 42-33 55 [P]
9.SD Eibar 38 14 9 15 44-50 51 [2 2 0 0 8-2 6]
10.Girona FC 38 14 9 15 50-59 51 [P] [2 0 0 2 2-8 0]
11.RCD Espanyol (Barcelona) 38 12 13 13 36-42 49 [4 2 2 0 7-5 8]
12.Real Sociedad (San Sebastián) 38 14 7 17 66-59 49 [4 1 1 2 6-7 4]
13.RC Celta (Vigo) 38 13 10 15 59-60 49 [4 1 1 2 7-8 3]
14.Deportivo Alavés (Vitoria) 38 15 2 21 40-50 47
15.Levante UD (Valencia) 38 11 13 14 44-58 46 [P]
16.Athletic de Bilbao 38 10 13 15 41-49 43 [2 1 0 1 2-1 3]
17.CD Leganés 38 12 7 19 34-51 43 [2 1 0 1 1-2 3]**
--------------------------------------------------------**
18.RC Deportivo (La Coruña) 38 6 11 21 38-76 29 Relegated
19.UD Las Palmas 38 5 7 26 24-74 22 Relegated
20.Málaga CF 38 5 5 28 24-61 20 Relegated
I want to convert this data (unstructured text) into pandas Data Frame, and I am primarily interested in three columns: position, team and points. So what I hope the outcome to look would be this: -> Expected outcome
I can create the first two columns without issues because they follow a clear structure: integer -> '.' -> text. But I'm struggling extracting 'end-of-season points' from all the other numbers that are irrelevant.
Maybe someone knows how to approach this. Thanks!
Here's my logic:
import re
from tabulate import tabulate
import pandas as pd
# Raw data provided
raw_data = my_unstructured_text
# Remove lines of hyphens
raw_data = re.sub(r'-{40,}', '', raw_data)
# Remove brackets and content inside them
raw_data = re.sub(r'\[.*?\]', '', raw_data)
# Remove "Champions" and "Relegated" words
raw_data = raw_data.replace('Champions', '').replace('Relegated', '')
# Split the data into rows
rows = raw_data.strip().split('\n')
# Split each row into columns
cleaned_data = [re.split(r'\s{2,}', row.strip()) for row in rows]
# Split the first column into position and team name
cleaned_data = [col[0].split('.', 1) + col[1:] for col in cleaned_data if col]
# Define column headers
headers = ['pos', 'tm', 'played', 'a', 'b', 'c', 'gd', 'pts']
# Create a DataFrame
df = pd.DataFrame(cleaned_data, columns=headers)
print(tabulate(df, headers='keys', tablefmt='psql'))
You can
findall
the relevant columns (using this regex), then make aDataFrame
:Output :