Using mlogit in R with variables that only apply to certain alternatives

1.8k views Asked by At

I am attempting to use mlogit in R to produce a transportation mode choice. The problem is that I have a variable that only applies to certain alternatives.

To be more specific, I am attempting to predict the probability of using auto, transit and non motorized modes of transportation. My predictors are: distance, transit wait time, number of vehicles in household and in vehicle travel time.

It works when I format it this way:

> amres<-mlogit(mode~ivt+board|distance+nveh,data=AMLOGIT)

However, the results I get for in vehicle travel time (ivt) does not make sense:

    > summary(amres)

Call:
mlogit(formula = mode ~ ivt + board | distance + nveh, data = AMLOGIT, 
    method = "nr", print.level = 0)

Frequencies of alternatives:
    auto   tansit nonmotor 
 0.24654  0.28378  0.46968 

nr method
5 iterations, 0h:0m:2s 
g'(-H)^-1g = 6.34E-08 
gradient close to zero 

Coefficients :
                        Estimate  Std. Error  t-value  Pr(>|t|)    
tansit:(intercept)    7.8392e-01  8.3761e-02   9.3590 < 2.2e-16 ***
nonmotor:(intercept)  3.2853e+00  7.1492e-02  45.9532 < 2.2e-16 ***
ivt                   1.6435e-03  1.2673e-04  12.9691 < 2.2e-16 ***
board                -3.9996e-04  1.2436e-04  -3.2161  0.001299 ** 
tansit:distance       3.2618e-04  2.0217e-05  16.1336 < 2.2e-16 ***
nonmotor:distance    -2.9457e-04  3.3772e-05  -8.7224 < 2.2e-16 ***
tansit:nveh          -1.5791e+00  4.5932e-02 -34.3799 < 2.2e-16 ***
nonmotor:nveh        -1.8008e+00  4.8577e-02 -37.0720 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Log-Likelihood: -10107
McFadden R^2:  0.30354 
Likelihood ratio test : chisq = 8810.1 (p.value = < 2.22e-16)

As you can see, the stats look great, but ivt should be a negitive coefficient and not a positive one. My thoughts are that the non-motorized portion, which is all 0, is affecting it. I believe what I have to do is use the third par of the equation as seen below:

> amres<-mlogit(mode~board|distance+nveh|ivt,data=AMLOGIT)

However, this results in:

Error in solve.default(H, g[!fixed]) : 
  Lapack routine dgesv: system is exactly singular: U[10,10] = 0

I believe this is, again, because the variable is all 0's for non-motorized but I am unsure how to fix this. How do I include an alternative specific variable if it does not apply to all alternatives?

2

There are 2 answers

3
ako On BEST ANSWER

I am not well versed in the various implementations of logit models, but I imagine it has to do with making sure you have variation across persons and alternatives to the matrix can be properly determined with variation across alternatives and choosers. What do you get from saying

amres<-mlogit(mode~distance| nveh | ivt+board,data=AMLOGIT)

mlogit has a group separation between the pipes, as I understand it as follows: first part is your basic formula, the second part is variables that don't vary across alternatives (i.e. are only person specific, gender, income--I think nveh should be here) while the third part varies by alternative.

Ken Train, incidentally, has a set of vignettes on mlogit specifically that might be helpful. Viton mentions the partition with pipes.

Ken Train's Vignettes

Philip Viton's Vignettes

Yves Croissant's Vignettes

1
dardisco On

Looks like you may have perfect separation. Have you checked this by e.g. looking at crosstables of the variables? (Can't fit a model if one combination of predictors allows for perfect prediction...) Would be helpful to know size of dataset in this regard - you may be over-fitting for the amount of data you have. This is a general problem in modelling, not specific to mlogit.

You say "the stats look great" but values for Pr(>|t|)s and the Likelihood ratio test look implausibly significant, which would be consistent with this problem. This means the estimates of the coefficients are likely to be inaccurate. (Are they similar to the coefficients produced by univariate modelling ?). Perhaps a simpler model would be more appropriate.

Edit @user3092719 :

You're fitting a generalized linear model, which can easily be overfit (as the outcome variable is discrete or nominal - i.e. has a restricted no. of values). mlogit is an extension of logistic regression; here's a simple example of the latter to illustrate:

> df1 <- data.frame(x=c(0, rep(1, 3)),
                    y=rep(c(0, 1), 2))
> xtabs( ~ x + y, data=df1)
   y
x   0 1
  0 1 0
  1 1 2

Note the zero in the top right corner. This shows 'perfect separation' which means you that if x=0 you know for sure that y=0 based on this set. So a probabilistic predictive model doesn't make much sense. Some output from

> summary(glm(y ~ x, data=df1, binomial(link = "logit")))

gives

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   -18.57    6522.64  -0.003    0.998
x              19.26    6522.64   0.003    0.998

Here the size of the Std. Errors are suspiciously large relative to the value of the coefficients. You should also be alerted by Number of Fisher Scoring iterations: 17 - the large no. iterations needed to fit suggests numerical instability.

Your solution seems to involve ensuring that this problem of complete separation does not occur in your model, although hard to be sure without having a minimal working example.