Saturday, January 26, 2013

Using the AIC to Fit a Continuous Variable for a Logistic Model

One of the assumptions that must be met for continuous variables in a logistic model is linearity with the transformed dependent variable*.  A useful method that was shared with me by one of our SAEG partners (Benjamin Nutter) and tested by myself (and found to be very useful) is using the AIC to test various linear transformations against the dependent variable in order to find a best relationship fit.  Here is the method we consulted on together about...

First run a simple logistic regression of the continuous variable against the dependent 1,0 variable.

The, graph the predicted probabilities against the variable.  Here are some example of shapes (by drawing a line to connect the observed observations) my output came out with:




As you can see, the shape of the relationship between the variable and the predicted probability  (which you obtain from the simple logistic regression) can vary.  Further, what is not shown here is that the concentration of the observations along a given shape can differ.  So, for example, one of these shapes may look like this if looking at the individual observations along the shape (I've included the connecting line just as a visual):



Here, most of the observations form a fairly straight line up to a point where there is a deviation from the most observed pattern.  If you do a Q-Q plot and distribution of the variable, this deviation from a normal value can also be seen (which is evidence of an outlier).  While removing outliers to improve a variable's linearity is not discussed in this article, taking out the outliers prior to running the next steps I will discuss is a consideration that could be made.

The main thing to observe is that if the shape of the relationship between the predicted probability and the continuous variable is not a straight line (most I saw were bowed, even if slightly), a linear transformation can be used to improve the fit.

What I did after doing several diagnostics to understand the range of the given continuous variable, I would create a package of transformations that I could try.  For variables with all positive observations (even with some zeroes) I might try 'square root', 'log base 10', 'natural log', reciprocal', 'second order polynomial' and '3rd order polynomial'.  Just remember when doing these that if you have 0's with the logs and reciprocals you have to set the result to zero (your software will likely make the result missing).  For variables that had negative and positive values (with some 0's) I would choose a more limited package, say 'reciprocal', 'second order polynomial', and '3rd order polynomial'.  Similarly, I would need to make sure that for the reciprocal transformation if something was a zero and set to missing after the calculation run, I would set the value to 0.

Next, after creating the transformations, I would run each through separate simple logistic regressions (one for each transformation), this time keeping my residuals.  With the residuals, you can calculate the AIC using this formula**:

AIC = n * Ln(RSS/n) + 2k

Where...
n=number of total observations in your dataset
RSS is derived by just squaring each residual and coming up with its sum (the residual sum of squares)
k=the number of parameters in your model (in this testing case it is two, the independent transformation variable and the constant)

Once you have run an AIC for each transformation, the one with the best fit is where the AIC is minimized.  You will also want to run the AIC for the non-transformed variable as a way of comparison.  Here are some sample results and the transformation chosen based on the results:




 Transformation chosen:  2nd order polynomial***



Useful References
* Discovering Statistics Using SPSS 3rd edition by Andy Field, P273--assumptions of logistic regression-- 1) Linearity 2) Independence of errors 3) Multicollinearity or rather non multicollinearity of your data
**If you want a source for this formula, I found it in this presentation, page 9:
http://www4.ncsu.edu/~shu3/Presentation/AIC.pdf

*** A second order polynomial transformation is simply X + X^2 where X=the observed variable.  A third order polynomial is X+X^2+X^3.  Note that I do not know how to write the notation for squared and cubed in this blog, so the ^ implies 'raised to the power of'

1 comment:

  1. A practically useful guidance for fitting continuous variables

    ReplyDelete