Here's what I've done so far, I'm having difficulty figuring out the regression line.
- Before we get started, we want to generate two tables. One for 2002 and another for the average of 1999-2001 seasons. We want to define per plate appearance statistics. Here is how we create the 2017 table. Keeping only players with more than 100 plate appearances. Now compute a similar table but with rates computed over 1999-2001.
library(Lahman)
data("Batting")
avg <- Batting %>% filter(yearID %in% 1999:2001) %>%
mutate(pa = AB + BB,
avg_singles = (H - X2B - X3B - HR) / pa, avg_bb = BB / pa) %>%
filter(pa >= 100) %>%
select(playerID, avg_singles, avg_bb)
dat <- Batting %>% filter(yearID == 2002) %>%
mutate(pa = AB + BB,
singles = (H - X2B - X3B - HR) / pa, bb = BB / pa) %>%
filter(pa >= 100) %>%
select(playerID, singles, bb)
- Compute the correlation between 2002 and the previous seasons for singles and BB.
dat <- inner_join(dat, avg, by = "playerID")
rdat <- dat %>%
summarise(singles_r = cor(singles,avg_singles ), bb_r = cor(bb, avg_bb ))
rdat
- Note that the correlation is higher for BB. To quickly get an idea of the uncertainty associated with this correlation estimate, we will fit a linear model and compute confidence intervals for the slope coefficient. However, first make scatterplots to confirm that fitting a linear model is appropriate.
library(ggplot2)
dat %>%
ggplot(aes(singles,avg_singles))+
geom_point(alpha = 0.5)
dat %>%
ggplot(aes(bb,avg_bb))+
geom_point(alpha = 0.5)
- Now fit a linear model for each metric and use the confint function to compare the estimates.
I would use the
lm
function to solve this question.Example:
Likewise for the
bb
as well.