I want to write a function that takes in data and runs a multinomial regression (using nnet::multinom
), then extracts a focal prediction (using Effects::effect
). While I'm able to get it done with regular code, a custom function fails.
Example
Background
I conduct a research to find which type of color people like the most: red, green, or blue. I sample 200 people, and ask them to choose the one color they like the most. Because I suspect that some variables might confound the results, I measure them as well: (1) sex, (2) color blindness, and (3) age.
Method
I will run a multinomial regression using nnet::multinom
and then extract a focal prediction from that model (using Effects::effect
) which will account for specific values for sex, color blindness, and age.
Data
library(tidyverse)
set.seed(2020)
df <-
data.frame(person_id = 1:200,
chosen_color = sample(c("red", "green", "blue"), size = 200, replace = TRUE),
age = sample(18:80, size = 200, replace = TRUE),
is_colorblind = sample(c(0, 1), prob = c(0.2, 0.8), size = 200, replace = TRUE),
is_female = sample(c(0, 1), prob = c(0.3, 0.7), size = 200, replace = TRUE)
)
as_tibble(df)
## # A tibble: 200 x 5
## person_id chosen_color age is_colorblind is_female
## <int> <chr> <int> <dbl> <dbl>
## 1 1 blue 57 1 0
## 2 2 blue 51 1 0
## 3 3 blue 38 1 1
## 4 4 red 30 1 1
## 5 5 green 78 1 1
## 6 6 red 72 1 0
## 7 7 green 63 1 1
## 8 8 green 69 0 0
## 9 9 red 57 1 0
## 10 10 blue 20 0 1
## # ... with 190 more rows
What is the proportion of popularity for each color?
(A) The simple yet likely inaccurate method
Just find the most frequent color in chosen color
:
df %>%
group_by(chosen_color) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
## # A tibble: 3 x 3
## chosen_color n freq
## <chr> <int> <dbl>
## 1 blue 76 0.38
## 2 green 60 0.3
## 3 red 64 0.32
Since I want to find insights that are general to the entire population, I have little faith in the accuracy of the table I got. This is because my sample is not representative. In my sample, 20% of the people are colorblinded and 70% are female. If I have a reason to believe that sex and colorblindedness might influence color popularity, then this sample is problematic.
(B) Accounting and correcting for sample (un)representativeness
Using regression I can: (1) model the relationship between color preference and demographics variables, and (2) predict a "corrected" average response based on the demographic values that occur in the population (but not necessarily in my sample). Since my variable of interest is nominal, I'm using a multinomial regression (with `nnet::multinom`).1. Fit the model
library(nnet)
fit <-
nnet::multinom(chosen_color ~ age + is_colorblind + is_female,
data = df)
2. Define a vector with the "corrected" values as they happen to be in the population level, to use in the prediction step.
- age -- I know that the average age in the population is 45.
- sex -- I know that sex is roughly at 50% split, thus 0.5.
- color blindness -- I know that on average, 2% of the population is color blinded (say). Hence 0.02.
one_average_person <-
c(age = 45,
is_female = 0.5,
is_colorblind = 0.02
)
3. Use a prediction function to get a focal prediction for each color, given the values in one_average_person
.
I've found only effects::Effect
to play well with a model generated from nnet::multinom
. Still, since I couldn't find a straightforward way to get a focal prediction for the values I specify, I ended up with a workaround. In the following code, age
is the "focal" predictor, but I also specify the other variables using the given.values
argument. Furthermore, I can't just ask for age = 45
because Effect
cannot take a single value, so I ask for a prediction for both age = 45
and age = 90
. Then I remove the prediction for 90
because I don't need it.
library(effects)
prediction <-
effects::Effect("age",
fit,
given.values = one_average_person,
xlevels = list(age = c(45,90)))
wrangled_prediction_data <-
data.frame(prediction$prob, prediction$lower.prob, prediction$upper.prob) %>%
slice(1) %>% ## <----- here I remove the unnecessary prediction for age = 90
pivot_longer(., cols = everything(),
names_to = c(".value", "response"),
names_pattern = "(.*)\\.(.*$)") %>%
rename("lower_ci" = "L.prob",
"upper_ci" = "U.prob",
"estimate" = "prob")
> wrangled_prediction_data
## # A tibble: 3 x 4
## response estimate lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl>
## 1 blue 0.474 0.328 0.625
## 2 green 0.290 0.172 0.445
## 3 red 0.236 0.129 0.391
The values in the table reflect the popularity of each color, when accounting for the situation in the population level.
Writing a function to streamline the regression + prediction process above
While I had to do some gymnastics with Effect
to get what I need (please give feedback on that if you see a better way than my awkward code), I want to write a function to make this work more concise.
My unsuccessful function
As you can see, I'm limited to using age
as a predictor, so I ended up building the function around age
. In reality, this is far from ideal, because not always will I have age in my data. But my function is not working regardless of that. The reason for this difficulty is that "age" is entered as a string in focal.predictors
argument, but as a variable in xlevels
(within a list). I tried using double curly brackets (of tidy evaluation), but still unsuccessful.
require(dplyr)
require(nnet)
require(effects)
analyze_multiple_choice_w_age <-
function(data,
vars_demog,
vars_dv,
age_var_for_Effect,
ave_age,
one_ave_person_vec) {
fit <-
data %>%
nnet::multinom(
data = .,
formula = as.formula(
paste(
vars_dv,
paste(names(select({{ data }}, vars_demog )), collapse = " + "),
sep = " ~ "
))
)
prediction <-
effects::Effect(
focal.predictors = age_var_for_Effect,
mod = fit,
given.values = one_average_person,
xlevels = list(age_var_for_Effect = c(ave_age, 90)
)
)
return(prediction)
}
Any ideas on making this function to work?
Here is a version of your function that works if you provide all variable names as strings:
What I changed:
as.formula
can directly work with strings, so I simplified thisrlang
, I use!!
to force the evaluation ofage_var_for_Effect
to use this as the variable name in the list. You can use:=
fromrlang
to assign a (forced) name as the variable name of a list, however this does not work in the normallist
but inrlang::list2