Translate Logistic Regression from SAS to R

1.2k views Asked by At

Here's my issue of the day:

At the moment I'm teaching myself Econometrics and making use of logistic regression. I have some SAS code and I want to be sure I understand it well first before trying to convert it to R. (I don't have and I don't know SAS). In this code, I want to model the probability for one person to be an 'unemployed employee'. By this I mean "age" between 15 and 64, and "tact" = "jobless". I want to try to predict this outcome with the following variables: sex, age and idnat (nationality number). (Other things being equal).

SAS code :

/* Unemployment rate : number of unemployment amongst the workforce */

proc logistic data=census;
class sex(ref="Man") age idnat(ref="spanish") / param=glm;
class tact (ref=first);
model tact = sex age idnat / link=logit;
where 15<=age<=64 and tact in ("Employee" "Jobless");
weight weight;
format age ageC. tact $activity. idnat $nat_dom. inat $nationalty. sex $M_W.;

lsmeans sex / obsmargins ilink;
lsmeans idnat / obsmargins ilink;
lsmeans age / obsmargins ilink;
run;

This is a sample of what the database should looks like :

      idnat     sex     age  tact      
 [1,] "english" "Woman" "42" "Employee"
 [2,] "french"  "Woman" "31" "Jobless" 
 [3,] "spanish" "Woman" "19" "Employee"
 [4,] "english" "Man"   "45" "Jobless" 
 [5,] "english" "Man"   "34" "Employee"
 [6,] "spanish" "Woman" "25" "Employee"
 [7,] "spanish" "Man"   "39" "Jobless" 
 [8,] "spanish" "Woman" "44" "Jobless" 
 [9,] "spanish" "Man"   "29" "Employee"
[10,] "spanish" "Man"   "62" "Retired" 
[11,] "spanish" "Man"   "64" "Retired" 
[12,] "english" "Woman" "53" "Jobless" 
[13,] "english" "Man"   "43" "Jobless" 
[14,] "french"  "Man"   "61" "Retired" 
[15,] "french"  "Man"   "50" "Employee"

This is the kind of result I wish to get :

Variable    Modality    Value   ChiSq   Indicator
Sex         Women       56.6%   0.00001 -8.9%
            Men         65.5%       
Nationality 
            1:Spanish   62.6%       
            2:French    51.2%   0.00001 -11.4%
            3:English   48.0%   0.00001 -14.6%
Age 
            <25yo       33.1%   0.00001 -44.9%
        Ref:26<x<54yo   78.0%       
            55yo=<      48.7%   0.00001 -29.3%

(I interpret the above as follows: other things being equal, women have -8.9% chance of being employed vs men and those aged less than 25 have a -44.9% chance of being employed than those aged between 26 and 54).

So if I understand well, the best approach would be to use a binary logistic regression (link=logit). This uses references "male vs female"(sex), "employee vs jobless"(from 'tact' variable)... I presume 'tact' is automatically converted to a binary (0-1) variable by SAS.

Here is my 1st attempt in R. I haven't check it yet (need my own PC) :

### before using multinom function 
### change all predictors to factors and relevel 
recens$sex <- relevel(factor(recens$sex), ref = "Man")
recens$idnat <- relevel(factor(recens$idnat), ref = "spanish")  
recens$TACT <- relevel(factor(recens$TACT), ref = "employee")

### Calculations of the probabilities with function multinom, 
### formatted variables, and conditions with subset 
glm1 <- glm(TACT ~ sex + age + idnat, data=census, 
+ weights = weight, subset=age[(15<=recens$age|recens$age<=64)] & TACT %in% 
+ c("Employee","Jobless"), family=binomial())

My questions :

For the moment, it seems there are many functions to carry out a logistic regression in R like glm which seems to fit.

However after visiting many forums it seems a lot of people recommend not trying to exactly reproduce SAS PROC LOGISTIC, particularly the function LSMEANS functions. Dr Franck Harrel, (author of package:rms) for one.

That said, I guess my big issue is LSMEANS and its options Obsmargins and ILINK. Even after reading over its description repeatedly I can hardly understand how it works.

So far, what I understand of Obsmargin is that it respects the structure of the total population of the database (i.e. calculations are done with proportions of the total population). ILINK appears to be used to obtain the predicted probability value (jobless rate, employment rate) for each of the predictors (e.g. female then male) rather than the value found by the (exponential) model?

In short, how could this be done through R, with rms functions like lrm?

I'm really lost in all of this. If someone could explain it to me better and tell me if I'm on the right track it would make my day.

Thank you for your help and sorry for all the mistakes my English is a bit rusty.

Binh

1

There are 1 answers

3
IRTFM On

That is not a multinomial logistic regression problem, since the outcome is binary. Furthermore the output you desire appears to be a set of two-way tables. The author of rms is Frank Harrell (and he also happens to be the original author of Proc LOGISTIC.) Just using lrm in rms to produce a set of two way tables would seem to be a waste of energy. This is an example of its use in presenting a multivariate analysis:

 require(rms)
 lrm(tact ~ idnat+sex+as.numeric(age), data=dat)
  #----------
Logistic Regression Model

lrm(formula = tact ~ idnat + sex + as.numeric(age), data = dat)

                     Model Likelihood     Discrimination    Rank Discrim.    
                        Ratio Test            Indexes          Indexes       
Obs            15    LR chi2     15.19    R2       0.725    C       0.903    
 Employee       6    d.f.            4    g        3.583    Dxy     0.806    
 Jobless        6    Pr(> chi2) 0.0043    gr      35.981    gamma   0.806    
 Retired        3                         gp       0.420    tau-a   0.552    
max |deriv| 1e-04                         Brier    0.147                     

              Coef     S.E.   Wald Z Pr(>|Z|)
y>=Jobless     -9.2553 3.8673 -2.39  0.0167  
y>=Retired    -13.5303 5.2031 -2.60  0.0093  
idnat=french    1.3199 1.8969  0.70  0.4865  
idnat=spanish   1.7379 1.5479  1.12  0.2616  
sex=Woman      -0.0033 1.3792  0.00  0.9981  
age             0.2213 0.0849  2.61  0.0091  

To get teh full power of regression functions in pkg:rms, you will need to create datadist objects and set the dd option. This question is rather outside the boundary of what is usually acceptable in SO, since you don't really know what you are doing. You may want to consider posting followup questions on CrossValidated to work out your conceptual gaps in understanding.