I have a set of dataframes/entity set for rugby league/sports data: players, teams, venues, games, team_stats and player_stats
players: player_id, player_name
teams: team_id, team_name
games: game_id, venue_id
venues: venue_id, venue_coordinates
team_stats: game_id, team_id + relevant team statistics for that game
player_stats: game_id, team_id, player_id + relevant player statistics for that game
The end goal is to generate features for a given team in a given game based on aggregated historical player statistics. e.g. game_id, team_id and for example sum of the player mean of a given statistic
As a worked example if we had 2 players in a team who both had historical values of [1, 2] for a given statistic I would like that to show as sum of mean statistic being 3 for the team.
What would be the most efficient way to achieve this?
I have tried using group by trans primitives to no avail. I'm not sure if I need to add team_id into the players dataframe?
If I understand your question correctly, what you are trying to do is aggregate player stats from the
player_statstable to theteam_statstable. One way to do this would be to create a relationship between those two tables and use various aggregation primitives.That relationship can be defined by the combination of
(team_id, game_id)in the two tables. Unfortunately, Featuretools does not support defining relationships using multiple columns in one table, so you probably need to create a new column that can be used to define the relationship.Here is one way you could do this:
This will generate a dataframe like this: