Computing standard deviation by groups by year and conference

48 views Asked by At

First, I am new to R programming, so much of my problem may be a misunderstanding of the basics.

I am working on research in college football and am trying to automate the calculation of the standard deviation within each conference by year.

My current dataframe is formatted with these variables:

Year, Conference, College, Wins, Losses, & w_pct (which is a calculation from wins and losses).

Here is a sample:

enter image description here

My question is primarily about grouping and making the standard deviation calculation of w_pct within each group (Year/Conference).

I have attempted group_by many times and ways, but when I add the stats::SD function to it, it either returns an error, or calculates one standard deviation for the entire data instead of by year and conference.

Is there a better/easier/more efficient way to do this? Or do I really need to create separate dataframes for each year/conference?

Any help is greatly appreciated!

Thanks!

Chris

1

There are 1 answers

0
Avraham On

Since we don't have real data, I will make up some sample data. This will not be "true" win-loss records which requires defining winning and losing teams, etc., but the idea and code should apply just as well to proper win-loss records as to random data.

Sample Data Creation

First, create five years of data for four conferences, each with six teams. Therefore, each year will have 24 entries.

set.seed(14L)
# Five years with 24 entries each 
Year <- rep(2020:2024, each = 24L)
# Four conferences with six teams each for the five years
Conference <- rep(c("North", "South", "East", "West"), each = 6L, times = 5L)
# 24 teams
Teams <- LETTERS[1:24]
# Wins for each team in each year (120 in total)
Wins <- sample(0:8, 120, replace = TRUE)
Losses = 8 - Wins
WP <- Wins / 8
Hist <- data.frame(Year = Year,
                   Conference = Conference,
                   Teams = Teams,
                   Wins = Wins,
                   Losses = Losses,
                   WP = WP)

Extract:

> head(Hist)
  Year Conference Teams Wins Losses    WP
1 2020      North     A    8      0 1.000
2 2020      North     B    8      0 1.000
3 2020      North     C    3      5 0.375
4 2020      North     D    3      5 0.375
5 2020      North     E    0      8 0.000
6 2020      North     F    8      0 1.000

> tail(Hist)
    Year Conference Teams Wins Losses    WP
115 2024       West     S    5      3 0.625
116 2024       West     T    8      0 1.000
117 2024       West     U    2      6 0.250
118 2024       West     V    8      0 1.000
119 2024       West     W    8      0 1.000
120 2024       West     X    1      7 0.125

data.table method

The easiest approach, in my opinion, would be to use data.table.

Prep

library(data.table)
setDT(Hist)

By Year

> Hist[, .(mean = mean(WP), sd = sd(WP)), keyby = Year]
   Year      mean        sd
1: 2020 0.5104167 0.3356464
2: 2021 0.5052083 0.3387942
3: 2022 0.5416667 0.2823299
4: 2023 0.5416667 0.3142163
5: 2024 0.5104167 0.3553104
> 

By Conference

> Hist[, .(mean = mean(WP), sd = sd(WP)), keyby = Conference]
   Conference      mean        sd
1:       East 0.4666667 0.3043034
2:      North 0.5958333 0.3500051
3:      South 0.4833333 0.3021713
4:       West 0.5416667 0.3255190

By Year By Conference

> Hist[, .(mean = mean(WP), sd = sd(WP)), keyby = c("Year", "Conference")]
    Year Conference      mean        sd
 1: 2020       East 0.6250000 0.2622022
 2: 2020      North 0.6250000 0.4330127
 3: 2020      South 0.2916667 0.2922613
 4: 2020       West 0.5000000 0.2958040
 5: 2021       East 0.4791667 0.2671220
 6: 2021      North 0.4791667 0.4063301
 7: 2021      South 0.6458333 0.3825626
 8: 2021       West 0.4166667 0.3322900
 9: 2022       East 0.5625000 0.2931510
10: 2022      North 0.4375000 0.2709935
11: 2022      South 0.4583333 0.2188988
12: 2022       West 0.7083333 0.3227486
13: 2023       East 0.4583333 0.3322900
14: 2023      North 0.8541667 0.2002602
15: 2023      South 0.4375000 0.2931510
16: 2023       West 0.4166667 0.2457980
17: 2024       East 0.2083333 0.2813657
18: 2024      North 0.5833333 0.3415650
19: 2024      South 0.5833333 0.2700309
20: 2024       West 0.6666667 0.4005205

base R methods

aggregate method

In base R, one can use aggregate. The syntax is a little bit more cumbersome due to the need for grouping variable to be lists. Also, I believe one can only pass one function at a time. Below will use sd.

By Year

aggregate(Hist$WP, by = list(Hist$Year), sd)
> aggregate(Hist$WP, by = list(Hist$Year), sd)
  Group.1         x
1    2020 0.3356464
2    2021 0.3387942
3    2022 0.2823299
4    2023 0.3142163
5    2024 0.3553104

By Conference

> aggregate(Hist$WP, by = list(Hist$Conference), sd)
  Group.1         x
1    East 0.3043034
2   North 0.3500051
3   South 0.3021713
4    West 0.3255190

By Year By Conference

Note that the quickest changing variable here is first where in data.table the slowest one is first.

> aggregate(Hist$WP, by = list(Hist$Conference, Hist$Year), sd)
   Group.1 Group.2         x
1     East    2020 0.2622022
2    North    2020 0.4330127
3    South    2020 0.2922613
4     West    2020 0.2958040
5     East    2021 0.2671220
6    North    2021 0.4063301
7    South    2021 0.3825626
8     West    2021 0.3322900
9     East    2022 0.2931510
10   North    2022 0.2709935
11   South    2022 0.2188988
12    West    2022 0.3227486
13    East    2023 0.3322900
14   North    2023 0.2002602
15   South    2023 0.2931510
16    West    2023 0.2457980
17    East    2024 0.2813657
18   North    2024 0.3415650
19   South    2024 0.2700309
20    West    2024 0.4005205

tapply method

If structure is less important one can also use tapply.

By Year

> tapply(Hist$WP, Hist$Year, sd)
     2020      2021      2022      2023      2024 
0.3356464 0.3387942 0.2823299 0.3142163 0.3553104 

By Conference

> tapply(Hist$WP, Hist$Conference, sd)
     East     North     South      West 
0.3043034 0.3500051 0.3021713 0.3255190

By Year By Conference

> tapply(Hist$WP, list(Hist$Year, Hist$Conference), sd)
          East     North     South      West
2020 0.2622022 0.4330127 0.2922613 0.2958040
2021 0.2671220 0.4063301 0.3825626 0.3322900
2022 0.2931510 0.2709935 0.2188988 0.3227486
2023 0.3322900 0.2002602 0.2931510 0.2457980
2024 0.2813657 0.3415650 0.2700309 0.4005205