First, I am new to R programming, so much of my problem may be a misunderstanding of the basics.
I am working on research in college football and am trying to automate the calculation of the standard deviation within each conference by year.
My current dataframe is formatted with these variables:
Year, Conference, College, Wins, Losses, & w_pct (which is a calculation from wins and losses).
Here is a sample:
My question is primarily about grouping and making the standard deviation calculation of w_pct within each group (Year/Conference).
I have attempted group_by many times and ways, but when I add the stats::SD function to it, it either returns an error, or calculates one standard deviation for the entire data instead of by year and conference.
Is there a better/easier/more efficient way to do this? Or do I really need to create separate dataframes for each year/conference?
Any help is greatly appreciated!
Thanks!
Chris
Since we don't have real data, I will make up some sample data. This will not be "true" win-loss records which requires defining winning and losing teams, etc., but the idea and code should apply just as well to proper win-loss records as to random data.
Sample Data Creation
First, create five years of data for four conferences, each with six teams. Therefore, each year will have 24 entries.
Extract:
data.table method
The easiest approach, in my opinion, would be to use
data.table.Prep
By Year
By Conference
By Year By Conference
base R methods
aggregate method
In base R, one can use
aggregate. The syntax is a little bit more cumbersome due to the need for grouping variable to be lists. Also, I believe one can only pass one function at a time. Below will usesd.By Year
By Conference
By Year By Conference
Note that the quickest changing variable here is first where in data.table the slowest one is first.
tapplymethodIf structure is less important one can also use
tapply.By Year
By Conference
By Year By Conference