I want to compute predictions for a fixed effects ols regression by group with the marginaleffects package. When I generate new data for the prediction and do not specify the specific level of the fixed effects variables that I want the prediction to use, the prediction function calculates the values at am==0, gear==0 and carb==4 (see below).
Where do these values come from? What's the most sensible way to calculate average predictions in the presence of fixed effects in this example?
library(tidyverse)
library(fixest)
library(marginaleffects)
dat <- mtcars
m1 <- dat %>%
mutate(cyl = as_factor(cyl)) %>%
feols(mpg ~ hp*cyl | am + gear + carb)
etable(m1)
#> m1
#> Dependent Var.: mpg
#>
#> hp -0.1316 (0.0504)
#> cyl6 -47.87 (27.42)
#> cyl8 -21.31 (14.80)
#> hp x cyl6 0.4434 (0.2619)
#> hp x cyl8 0.1471 (0.0665)
#> Fixed-Effects: ----------------
#> am Yes
#> gear Yes
#> carb Yes
#> _______________ ________________
#> S.E.: Clustered by: am
#> Observations 32
#> R2 0.88892
#> Within R2 0.38565
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pred1 <- predictions(m1,
newdata = datagrid(
hp = c(min(dat$hp):max(dat$hp)),
cyl = levels(dat$cyl)))
pred1
#>
#> Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 % cyl am gear carb hp
#> 9.76 8.72 1.119 0.263 1.9 -7.33 26.9 8 0 3 4 52
#> 9.78 8.60 1.136 0.256 2.0 -7.09 26.6 8 0 3 4 53
#> 9.79 8.49 1.154 0.249 2.0 -6.84 26.4 8 0 3 4 54
#> 9.81 8.37 1.172 0.241 2.1 -6.60 26.2 8 0 3 4 55
#> 9.82 8.25 1.190 0.234 2.1 -6.35 26.0 8 0 3 4 56
#> --- 274 rows omitted. See ?avg_predictions and ?print.marginaleffects ---
#> 14.09 23.89 0.590 0.555 0.8 -32.73 60.9 8 0 3 4 331
#> 14.11 24.01 0.588 0.557 0.8 -32.94 61.2 8 0 3 4 332
#> 14.12 24.12 0.586 0.558 0.8 -33.15 61.4 8 0 3 4 333
#> 14.14 24.24 0.583 0.560 0.8 -33.37 61.6 8 0 3 4 334
#> 14.16 24.36 0.581 0.561 0.8 -33.58 61.9 8 0 3 4 335
#> Columns: rowid, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, mpg, cyl, am, gear, carb, hp
Created on 2023-08-22 with reprex v2.0.2
You ask:
(I think you meant gear=6. Must be a typo.)
When a variable is not explicitly specified by name in
datagrid, the value is picked by a function which depends on the variable class. Please read?datagridand check out theFUN_character,FUN_numeric, and other arguments to that function.In
fixestmodels, the variables after the|are treated as factor variables, so the default is to take the mode. When there is more than one mode, the choice is arbitrary (I think it’s the first one to appear in the dataset, but not 100% sure). The user can of course customize this by specifying the relevantFUN_*argument.You ask:
In my opinion, it is typically most sensible to not specify the
newdataargument at all when computing average predictions by group. When you leave,newdata=NULL, the function will compute predictions for each row of the original data, and then average them by group. This default behavior gives you predicted values averaged across the empirical distribution of the covariates.That said, this is a statistical question whose answer may vary depending on your specific research context and application. If I were you, I would consult a local statistician or open a new question on Cross Validated (not Stack Overflow) with a lot more detail about the specific research question that you hope to answer.