How to write a custom function for extracting predictions from `effects::Effect()`

213 views Asked by At

I want to write a function that takes in data and runs a multinomial regression (using nnet::multinom), then extracts a focal prediction (using Effects::effect). While I'm able to get it done with regular code, a custom function fails.

Example

Background

I conduct a research to find which type of color people like the most: red, green, or blue. I sample 200 people, and ask them to choose the one color they like the most. Because I suspect that some variables might confound the results, I measure them as well: (1) sex, (2) color blindness, and (3) age.

Method

I will run a multinomial regression using nnet::multinom and then extract a focal prediction from that model (using Effects::effect) which will account for specific values for sex, color blindness, and age.

Data

library(tidyverse)

set.seed(2020)

df <-
  data.frame(person_id = 1:200,
             chosen_color = sample(c("red", "green", "blue"), size = 200, replace = TRUE),
             age = sample(18:80, size = 200, replace = TRUE),
             is_colorblind = sample(c(0, 1), prob = c(0.2, 0.8), size = 200, replace = TRUE),
             is_female = sample(c(0, 1), prob = c(0.3, 0.7), size = 200, replace = TRUE)
           )

as_tibble(df)

## # A tibble: 200 x 5
##    person_id chosen_color   age is_colorblind is_female
##        <int> <chr>        <int>         <dbl>     <dbl>
##  1         1 blue            57             1         0
##  2         2 blue            51             1         0
##  3         3 blue            38             1         1
##  4         4 red             30             1         1
##  5         5 green           78             1         1
##  6         6 red             72             1         0
##  7         7 green           63             1         1
##  8         8 green           69             0         0
##  9         9 red             57             1         0
## 10        10 blue            20             0         1
## # ... with 190 more rows

What is the proportion of popularity for each color?

(A) The simple yet likely inaccurate method

Just find the most frequent color in chosen color:

df %>%
  group_by(chosen_color) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n))

## # A tibble: 3 x 3
##   chosen_color     n  freq
##   <chr>        <int> <dbl>
## 1 blue            76  0.38
## 2 green           60  0.3 
## 3 red             64  0.32

Since I want to find insights that are general to the entire population, I have little faith in the accuracy of the table I got. This is because my sample is not representative. In my sample, 20% of the people are colorblinded and 70% are female. If I have a reason to believe that sex and colorblindedness might influence color popularity, then this sample is problematic.

(B) Accounting and correcting for sample (un)representativeness

Using regression I can: (1) model the relationship between color preference and demographics variables, and (2) predict a "corrected" average response based on the demographic values that occur in the population (but not necessarily in my sample). Since my variable of interest is nominal, I'm using a multinomial regression (with `nnet::multinom`).

1. Fit the model

library(nnet)

fit <-
  nnet::multinom(chosen_color ~ age + is_colorblind + is_female,
                 data = df)

2. Define a vector with the "corrected" values as they happen to be in the population level, to use in the prediction step.

  • age -- I know that the average age in the population is 45.
  • sex -- I know that sex is roughly at 50% split, thus 0.5.
  • color blindness -- I know that on average, 2% of the population is color blinded (say). Hence 0.02.
one_average_person <- 
  c(age = 45,
    is_female = 0.5,
    is_colorblind = 0.02
  )

3. Use a prediction function to get a focal prediction for each color, given the values in one_average_person.

I've found only effects::Effect to play well with a model generated from nnet::multinom. Still, since I couldn't find a straightforward way to get a focal prediction for the values I specify, I ended up with a workaround. In the following code, age is the "focal" predictor, but I also specify the other variables using the given.values argument. Furthermore, I can't just ask for age = 45 because Effect cannot take a single value, so I ask for a prediction for both age = 45 and age = 90. Then I remove the prediction for 90 because I don't need it.

library(effects)

prediction <- 
  effects::Effect("age", 
                  fit, 
                  given.values = one_average_person, 
                  xlevels = list(age = c(45,90)))


wrangled_prediction_data <-
  data.frame(prediction$prob, prediction$lower.prob, prediction$upper.prob) %>% 
  slice(1) %>%  ## <----- here I remove the unnecessary prediction for age = 90
  pivot_longer(., cols = everything(), 
               names_to = c(".value", "response"), 
               names_pattern = "(.*)\\.(.*$)") %>%
  rename("lower_ci" = "L.prob",
         "upper_ci" = "U.prob",
         "estimate" = "prob")


> wrangled_prediction_data

## # A tibble: 3 x 4
##   response estimate lower_ci upper_ci
##   <chr>       <dbl>    <dbl>    <dbl>
## 1 blue        0.474    0.328    0.625
## 2 green       0.290    0.172    0.445
## 3 red         0.236    0.129    0.391

The values in the table reflect the popularity of each color, when accounting for the situation in the population level.

Writing a function to streamline the regression + prediction process above

While I had to do some gymnastics with Effect to get what I need (please give feedback on that if you see a better way than my awkward code), I want to write a function to make this work more concise.

My unsuccessful function

As you can see, I'm limited to using age as a predictor, so I ended up building the function around age. In reality, this is far from ideal, because not always will I have age in my data. But my function is not working regardless of that. The reason for this difficulty is that "age" is entered as a string in focal.predictors argument, but as a variable in xlevels (within a list). I tried using double curly brackets (of tidy evaluation), but still unsuccessful.

require(dplyr)
require(nnet)
require(effects)

analyze_multiple_choice_w_age <-
  function(data,
           vars_demog,
           vars_dv,
           age_var_for_Effect,
           ave_age,
           one_ave_person_vec) {
    fit <-
      data %>%
      nnet::multinom(
        data = .,
        formula = as.formula(
        paste(
          vars_dv,
          paste(names(select({{ data }}, vars_demog )), collapse = " + "),
          sep = " ~ "
        )) 
        )
    
    prediction <-
      effects::Effect(
        focal.predictors = age_var_for_Effect,
        mod = fit,
        given.values = one_average_person,
        xlevels = list(age_var_for_Effect = c(ave_age, 90)
        )
      )
    
    return(prediction)

  }

Any ideas on making this function to work?

1

There are 1 answers

0
starja On BEST ANSWER

Here is a version of your function that works if you provide all variable names as strings:

set.seed(2020)

df <-
  data.frame(person_id = 1:200,
             chosen_color = sample(c("red", "green", "blue"), size = 200, replace = TRUE),
             age = sample(18:80, size = 200, replace = TRUE),
             is_colorblind = sample(c(0, 1), prob = c(0.2, 0.8), size = 200, replace = TRUE),
             is_female = sample(c(0, 1), prob = c(0.3, 0.7), size = 200, replace = TRUE)
  )

require(dplyr)
require(nnet)
require(effects)
library(rlang)

analyze_multiple_choice_w_age <-
  function(data,
           vars_demog,
           vars_dv,
           age_var_for_Effect,
           ave_age,
           one_ave_person_vec) {
    fit <-
      data %>%
      nnet::multinom(
        data = .,
        formula = as.formula(
          paste(
            vars_dv,
            paste(vars_demog, collapse = " + "),
            sep = " ~ "
          )) 
      )
    
    prediction <-
      effects::Effect(
        focal.predictors = age_var_for_Effect,
        mod = fit,
        given.values = one_ave_person_vec,
        xlevels = list2(!!age_var_for_Effect := c(ave_age, 90)
        )
      )
    
    return(prediction)
    
  }

test <- analyze_multiple_choice_w_age(
  data = df,
  vars_demog = c("age", "is_colorblind", "is_female"),
  vars_dv = "chosen_color",
  age_var_for_Effect = "age",
  ave_age = 45,
  one_ave_person_vec = c(age = 45,
                         is_female = 0.5,
                         is_colorblind = 0.02
  )
)


test

age effect (probability) for blue
age
       45        90 
0.3030466 0.2604459 

age effect (probability) for green
age
       45        90 
0.3992617 0.5270109 

age effect (probability) for red
age
       45        90 
0.2976917 0.2125432 

What I changed:

  • as.formula can directly work with strings, so I simplified this
  • from rlang, I use !! to force the evaluation of age_var_for_Effect to use this as the variable name in the list. You can use := from rlang to assign a (forced) name as the variable name of a list, however this does not work in the normal list but in rlang::list2