Welcome to something new at SAEG, a 'Questions From You' ongoing series. Basically, when you ask us a question whether through LinkedIn or at our address (enduranalyst@yahoo.com), we will do all we can to answer your questions in a professional and timely way. And the rule is...you can feel safe and assured to ask us any of your analysis questions, no matter the topic and no matter how silly you may think the question is. So you know, one of the values we seek to uphold at SAEG is respect for you, respect for where you are at the learning curve in your career, and respect for you as a professional. In no way should you feel demeaned by asking us questions and we hope that in no way we make you feel uncomfortable asking them. Statistics can be a complicated science and I've seen some who are at very high levels treat those at lower levels in a disrespectful and sometimes dehumanizing way. Rest assured, you will be safe at SAEG...it is our promise to you.
Also, the caveat that we have at SAEG is that if we don't readily know the answer to your question, we will research it for you (and point you in the direction as appropriate). Further, let us know if you allow us to use your name in the blog or if you'd rather remain anonymous. Thank you!
Without further ado, I am happy to start the 'Questions From You', with Episode 1
Questions From You, Episode 1: On C-STAT and our thoughts on Multinomial Logistic Models
These questions came in from Jing Li, one of our followers:
1. C-STAT is a very commonly used criterion to
evaluate how well a predictive model performs. But I sometimes had faced
the situation that some variables improve the model fit statistics (for
example, Hosmer-Lemeshow lack of fit statistic). But they don't improve
the C-STAT or even worse, decreases a little bit. What would we say for
these variables? Should they be included in the model or not.
2. Do you think it is a good idea or a bad idea to consider multinomial
logistic regression as predictive models? I have come across some strong
criticism against that.
Our answer:
Kirk: Hi Jing. As far as #1 goes, I have not used C-STAT, but Ben has (Ben Nutter is one of our Partners at SAEG). Here was his answer to you:
Ben: The c-statistic is a great tool, but it isn’t one I consider when
evaluating a model’s fit. The purpose of the c-statistic is to measure
how well a model discriminates between an event and a non-event.
Sometimes, you can improve the c-statistic by degrading the traditional
fit of the model, but if your goal is prediction, that’s a sacrifice
worth making. This isn’t to say that we should ignore model fit when
attempting to predict outcomes—for instance, we definitely don’t want to
overfit the model—but we might tolerate some multicollinearity for the
sake of improving our prediction.
Ultimately, AIC, Hosmer-Lemeshow, and the number of degrees of freedom
relative to the number of events are better methods for evaluating fit
than the C-statistic itself.
Kirk: As far as #2--Can I ask what context you are using the multinomial
logistic model? Ben has not done them, however I have seen them used
before for Asset/Liability models. In my readings on them, I understand
that that they are very similar to logistic models, except that they
predict multiple outcomes. Let me know what your context is and I can
better answer your question.
Jing's reply...
Kirk,
Thanks for your reply.
The predicting modelling that I am doing is related to hospital
admission. So the dependent variables would be a certain threshold of
days. For example, whether the patient would get admitted within 30 days
or not.
By multinomial, I originally was thinking to create the dependent
variables as 0-30 days, 30-60 days, etc. But someone has criticized this
approaching by pointing out the the nature of time is not suitable for
this type of categorization. Please let me know what you think.
Thanks a lot for your help and feel free to post this question to the forum if you feel more appropriate.
Best regards,
Jing
Our reply...
Jing--here's a start--am also checking with other people and will get back to you...
I was reading this article about Mutinomial logit models and is suggests
that the dependent variable have 'no natural order'. In your case, I
think they do, because someone can go from one time group to another and
there is an order to the time.
http://kurt.schmidheiny.name/teaching/multinomialchoice2up.pdf
Further, it puts it into the terms of someone 'choosing' based on
characeristics (or categories). Other things I've read talk about
'placement' to a membership. From what you are doing, it seems these are
day groups that people 'fall into', vs. by choice or placement into.
Further, something I find useful whenever I'm trying to decide whether
or not to use a given model or how to prepare its variables is to look
at the assumptions. Are any of the assumptions violated if you pursue a
certain path? If you look at the assumptions of the multinomial logistic
regression for example (check out this article:
http://www.unt.edu/rss/class/Jon/Benchmarks/MLR_JDS_Aug2011.pdf
), one is the assumption of independence among the
dependent variable choices. It says that "This assumption states that
the choice of or membership in one category is not related to
the choice or membership of another category (i.e., the dependent
variable). The assumption of independence can be tested with the
Hausman-McFadden test."
I found a nice article about this Hausman-McFadden test:
http://home.comcast.net/~alan.clayton-matthews/pp745/IIA_Hausman_Test.pdf
Basically its a test that "tests the null hypothesis that the inclusion
of the unskilled occupation category does not change the odds ratio of
the other pairs of choices". I would recommend you try this test. My
guess is that since someone having been in 0-30 would affect someone
going into 30-60, and would also affect someone going into 60-90, this would go against the assumption required for the dependent variable for
the multinomial logit model.
You might consider using a different modeling technique that does not
have this stringent assumption...i.e. I would suggest going the decision
tree route (i.e. CHAID, CRT, etc) as you may have better luck.
Jing's reply...
Kirk,
Thank you very much for your comments. They are definitely helpful.
Regarding the multi-nomial regression, you are right that our categories
regarding the membership of categories are NOT independent because they
are following a time sequence. Survival analysis or something similar might
be a more appropriate option.
Aside--upon further research, I was able to find an academic article through a library database for Jing, that looked to be about the same exact problem she was trying to model. Here is the reference for that article, for those interested...
This leads me to an Effective Analyst Thought...
If you're modeling something new or have questions about your approach, try delving into academic articles! If you find a good source of vetted articles (i.e. I really like JSTOR), chances are that you can find someone that has done something similar to what you are doing and it can provide you enlightenment on your path forward.

No comments:
Post a Comment