Thursday, April 19, 2012

The more data from different source you get, the better you can understand the picture

 By Kirk Harrington

Lets say I give you the part of a picture...





Analyzing this piece alone you could say that it has dark colors, varying between black and brown.

If I gave you more....











You would see there's more to the picture.  There is some sort of extension coming from the darkness, forming a tubular pattern with wrinkled something...is this wood...is this fabric...you don't really know because you can't see more beyond it.

But if I gave you more...












You would see that the tubular extension is actually attached to hands...so you know its fabric now and that the picture must be of a person.  But are these a woman's hands or man's?  Where is this person?  Without more information, you wouldn't know.  You can also tell now more than before that this is a painting.

Let me give you more...














This picture certainly gives you more.  This looks like this could be more of a woman than a man, especially since the hair comes down to below the neckline.  And, it looks like this person is outside (given what looks to be mountains and a path in the background)...or could they be on a balcony?  Is this person happy or sad?  You can't tell without seeing more.

Now, if I give you the full picture...
 ...you would recognize it as the famous picture Mona Lisa...posing happily for Leonardo Da Vinci.  Or would you really know she was posing for him since you don't see him?  You'd have to do more research to find out.  You've only heard that, but you ought to verify it with outside data.

The point of this excercise is to show that the more data you have about something, the more you know what it really looks like and how you can predict what it is.  The more data sources you can draw from, the more that you can get a more powerful view of what you're trying to predict.

For example...

With Credit risk data, you could look at credit scores alone...but add to that customer behavior, demographics, geography, and economic environment, and you can get a more clear picture of what might be causing a default or delinquency.

With marketing data, you could learn something about a customer's age and income, but what about their preferences, where they typically shop, their shopping behavior, seasonal patterns of their shopping behavior, what their family situation is (i.e. do they have children, are they married?).  If you're trying to predict if they purchase your product, the more information you have about them from various sources, the better you can predict the kind of factors that affect that purchase.

~~~~~~~~~~~
Effective Analyst Thought....

Surprise your client or customer with a sense of humor.  A little ice-breaking can go a long way towards them accepting you as a valued co-worker AND it can go a long way towards them accepting your analysis as valuable.

Wednesday, April 11, 2012

Dependent variable normalization 'in time'


 By Kirk Harrington

The dependent variable in a model is extremely important, however its accuracy can sometimes be overlooked (particularly if it has to do with timing and if it is to explain behavior).  What is predicted is a key part of the model and it drives the results that ultimately affect any predictions that come out.  An incorrectly made dependent variable can also affect the accuracy of the independent variables that are estimated off of it.  Consider a dependent variable for a default model (logistic time series):



If your object is to explain account behavior leading up to the end of the behavior cycle point, the end of the behavior cycle point would be the best timing for the dependent variable (in this example it could be the last episode of an account going into 90 days before it starts the default cycle of >90 days).  This would vary by account, however the independent variables leading up to this end of behavior cycle point would be most reflective of the account behavior at that 'time'.

If you take the dependent variable farther out, say to when an account falls off the books, this is neither in the account's control or in the organization's control (the organization doing the charge-off) and any appended independent variables meant to predict the dependent variable would have trouble coalescing around positive and negative correlations with the dependent variable.

After the point of 'no control' (in this case when an account falls off the books during the default cycle), many things can happen--there can be a 'blitz' of charge-offs at the end of a given quarter, month, or year.  It can take forever for a charge-off to finally happen because of legal matters and processing.  The organization may not have a set policy for charge-offs, therefore the timing of when it happens can vary widely.  'Normalizing' the dependent variable around the end of behavior cycle point vs. when it 'fall off the books' will improve your model's ability to capture account behavior before it reaches the point of 'no control'.

~~~
Effective Analyst Thought:
If you are in the position to consult with a client as an analyst, don't just ask them how they want something measured and not question it (especially if the person you may be dealing with does not have a modeling background).  Ask more questions, learn their data a bit more, and work together with the client to determine the best measurement that is statistically sound.  It is better to hash this out in the beginning vs. later after the
model is built and already being used.