Here's my issue of the day:
At the moment I'm teaching myself Econometrics and making use of logistic regression. I have some SAS code and I want to be sure I understand it well first before trying to convert it to R. (I don't have and I don't know SAS). In this code, I want to model the probability for one person to be an 'unemployed employee'. By this I mean "age" between 15 and 64, and "tact" = "jobless". I want to try to predict this outcome with the following variables: sex, age and idnat (nationality number). (Other things being equal).
SAS code :
/* Unemployment rate : number of unemployment amongst the workforce */
proc logistic data=census;
class sex(ref="Man") age idnat(ref="spanish") / param=glm;
class tact (ref=first);
model tact = sex age idnat / link=logit;
where 15<=age<=64 and tact in ("Employee" "Jobless");
weight weight;
format age ageC. tact $activity. idnat $nat_dom. inat $nationalty. sex $M_W.;
lsmeans sex / obsmargins ilink;
lsmeans idnat / obsmargins ilink;
lsmeans age / obsmargins ilink;
run;
This is a sample of what the database should looks like :
idnat sex age tact
[1,] "english" "Woman" "42" "Employee"
[2,] "french" "Woman" "31" "Jobless"
[3,] "spanish" "Woman" "19" "Employee"
[4,] "english" "Man" "45" "Jobless"
[5,] "english" "Man" "34" "Employee"
[6,] "spanish" "Woman" "25" "Employee"
[7,] "spanish" "Man" "39" "Jobless"
[8,] "spanish" "Woman" "44" "Jobless"
[9,] "spanish" "Man" "29" "Employee"
[10,] "spanish" "Man" "62" "Retired"
[11,] "spanish" "Man" "64" "Retired"
[12,] "english" "Woman" "53" "Jobless"
[13,] "english" "Man" "43" "Jobless"
[14,] "french" "Man" "61" "Retired"
[15,] "french" "Man" "50" "Employee"
This is the kind of result I wish to get :
Variable Modality Value ChiSq Indicator
Sex Women 56.6% 0.00001 -8.9%
Men 65.5%
Nationality
1:Spanish 62.6%
2:French 51.2% 0.00001 -11.4%
3:English 48.0% 0.00001 -14.6%
Age
<25yo 33.1% 0.00001 -44.9%
Ref:26<x<54yo 78.0%
55yo=< 48.7% 0.00001 -29.3%
(I interpret the above as follows: other things being equal, women have -8.9% chance of being employed vs men and those aged less than 25 have a -44.9% chance of being employed than those aged between 26 and 54).
So if I understand well, the best approach would be to use a binary logistic regression (link=logit). This uses references "male vs female"(sex), "employee vs jobless"(from 'tact' variable)... I presume 'tact' is automatically converted to a binary (0-1) variable by SAS.
Here is my 1st attempt in R. I haven't check it yet (need my own PC) :
### before using multinom function
### change all predictors to factors and relevel
recens$sex <- relevel(factor(recens$sex), ref = "Man")
recens$idnat <- relevel(factor(recens$idnat), ref = "spanish")
recens$TACT <- relevel(factor(recens$TACT), ref = "employee")
### Calculations of the probabilities with function multinom,
### formatted variables, and conditions with subset
glm1 <- glm(TACT ~ sex + age + idnat, data=census,
+ weights = weight, subset=age[(15<=recens$age|recens$age<=64)] & TACT %in%
+ c("Employee","Jobless"), family=binomial())
My questions :
For the moment, it seems there are many functions to carry out a logistic regression in R like glm
which seems to fit.
However after visiting many forums it seems a lot of people recommend not trying to exactly reproduce SAS PROC LOGISTIC
, particularly the function LSMEANS
functions. Dr Franck Harrel, (author of package:rms
) for one.
That said, I guess my big issue is LSMEANS
and its options Obsmargins
and ILINK
. Even after reading over its description repeatedly I can hardly understand how it works.
So far, what I understand of Obsmargin
is that it respects the structure of the total population of the database (i.e. calculations are done with proportions of the total population). ILINK
appears to be used to obtain the predicted probability value (jobless rate, employment rate) for each of the predictors (e.g. female then male) rather than the value found by the (exponential) model?
In short, how could this be done through R, with rms
functions like lrm
?
I'm really lost in all of this. If someone could explain it to me better and tell me if I'm on the right track it would make my day.
Thank you for your help and sorry for all the mistakes my English is a bit rusty.
Binh
That is not a multinomial logistic regression problem, since the outcome is binary. Furthermore the output you desire appears to be a set of two-way tables. The author of
rms
is Frank Harrell (and he also happens to be the original author of Proc LOGISTIC.) Just usinglrm
in rms to produce a set of two way tables would seem to be a waste of energy. This is an example of its use in presenting a multivariate analysis:To get teh full power of regression functions in pkg:rms, you will need to create datadist objects and set the dd option. This question is rather outside the boundary of what is usually acceptable in SO, since you don't really know what you are doing. You may want to consider posting followup questions on CrossValidated to work out your conceptual gaps in understanding.