Thursday, March 7, 2013

Questions From You, Episode 1

Welcome to something new at SAEG, a 'Questions From You' ongoing series.  Basically, when you ask us a question whether through LinkedIn or at our address (enduranalyst@yahoo.com), we will do all we can to answer your questions in a professional and timely way.  And the rule is...you can feel safe and assured to ask us any of your analysis questions, no matter the topic and no matter how silly you may think the question is.  So you know, one of the values we seek to uphold at SAEG is respect for you, respect for where you are at the learning curve in your career, and respect for you as a professional.  In no way should you feel demeaned by asking us questions and we hope that in no way we make you feel uncomfortable asking them.  Statistics can be a complicated science and I've seen some who are at very high levels treat those at lower levels in a disrespectful and sometimes dehumanizing way.  Rest assured, you will be safe at SAEG...it is our promise to you.

Also, the caveat that we have at SAEG is that if we don't readily know the answer to your question, we will research it for you (and point you in the direction as appropriate).   Further, let us know if you allow us to use your name in the blog or if you'd rather remain anonymous.  Thank you!

Without further ado, I am happy to start the 'Questions From You', with Episode 1

Questions From You, Episode 1:  On C-STAT and our thoughts on Multinomial Logistic Models

These questions came in from Jing Li, one of our followers:

1. C-STAT is a very commonly used criterion to evaluate how well a predictive model performs. But I sometimes had faced the situation that some variables improve the model fit statistics (for example, Hosmer-Lemeshow lack of fit statistic). But they don't improve the C-STAT or even worse, decreases a little bit. What would we say for these variables? Should they be included in the model or not.

2. Do you think it is a good idea or a bad idea to consider multinomial logistic regression as predictive models? I have come across some strong criticism against that. 


Our answer:
 
Kirk:  Hi Jing. As far as #1 goes, I have not used C-STAT, but Ben has (Ben Nutter is one of our Partners at SAEG). Here was his answer to you:

Ben:  The c-statistic is a great tool, but it isn’t one I consider when evaluating a model’s fit. The purpose of the c-statistic is to measure how well a model discriminates between an event and a non-event. Sometimes, you can improve the c-statistic by degrading the traditional fit of the model, but if your goal is prediction, that’s a sacrifice worth making. This isn’t to say that we should ignore model fit when attempting to predict outcomes—for instance, we definitely don’t want to overfit the model—but we might tolerate some multicollinearity for the sake of improving our prediction.

Ultimately, AIC, Hosmer-Lemeshow, and the number of degrees of freedom relative to the number of events are better methods for evaluating fit than the C-statistic itself.


Kirk:  As far as #2--Can I ask what context you are using the multinomial logistic model? Ben has not done them, however I have seen them used before for Asset/Liability models. In my readings on them, I understand that that they are very similar to logistic models, except that they predict multiple outcomes. Let me know what your context is and I can better answer your question.

Jing's reply...

Kirk,
Thanks for your reply.

The predicting modelling that I am doing is related to hospital admission. So the dependent variables would be a certain threshold of days. For example, whether the patient would get admitted within 30 days or not.

By multinomial, I originally was thinking to create the dependent variables as 0-30 days, 30-60 days, etc. But someone has criticized this approaching by pointing out the the nature of time is not suitable for this type of categorization. Please let me know what you think.

Thanks a lot for your help and feel free to post this question to the forum if you feel more appropriate.

Best regards,
Jing  


Our reply...

Jing--here's a start--am also checking with other people and will get back to you...

I was reading this article about Mutinomial logit models and is suggests that the dependent variable have 'no natural order'. In your case, I think they do, because someone can go from one time group to another and there is an order to the time.
http://kurt.schmidheiny.name/teaching/multinomialchoice2up.pdf

Further, it puts it into the terms of someone 'choosing' based on characeristics (or categories). Other things I've read talk about 'placement' to a membership. From what you are doing, it seems these are day groups that people 'fall into', vs. by choice or placement into.

Further, something I find useful whenever I'm trying to decide whether or not to use a given model or how to prepare its variables is to look at the assumptions. Are any of the assumptions violated if you pursue a certain path? If you look at the assumptions of the multinomial logistic regression for example (check out this article:
http://www.unt.edu/rss/class/Jon/Benchmarks/MLR_JDS_Aug2011.pdf ), one is the assumption of independence among the dependent variable choices. It says that "This assumption states that the choice of or membership in one category is not related to the choice or membership of another category (i.e., the dependent variable). The assumption of independence can be tested with the Hausman-McFadden test."

I found a nice article about this Hausman-McFadden test:
http://home.comcast.net/~alan.clayton-matthews/pp745/IIA_Hausman_Test.pdf

Basically its a test that "tests the null hypothesis that the inclusion of the unskilled occupation category does not change the odds ratio of the other pairs of choices". I would recommend you try this test. My guess is that since someone having been in 0-30 would affect someone going into 30-60, and would also affect someone going into 60-90, this would go against the assumption required for the dependent variable for the multinomial logit model.

You might consider using a different modeling technique that does not have this stringent assumption...i.e. I would suggest going the decision tree route (i.e. CHAID, CRT, etc) as you may have better luck.


Jing's reply...

Kirk,
Thank you very much for your comments. They are definitely helpful.
Regarding the multi-nomial regression, you are right that our categories regarding the membership of categories are NOT independent because they are following a time sequence. Survival analysis or something similar might be a more appropriate option. 


Aside--upon further research, I was able to find an academic article through a library database for Jing, that looked to be about the same exact problem she was trying to model.  Here is the reference for that article, for those interested...









This leads me to an Effective Analyst Thought...

If you're modeling something new or have questions about your approach, try delving into academic articles!  If you find a good source of vetted articles (i.e. I really like JSTOR), chances are that you can find someone that has done something similar to what you are doing and it can provide you enlightenment on your path forward.


 
 

No comments:

Post a Comment