Tuesday, February 11, 2014

Take time to write out your modeling process

Something that has helped me, particularly when answering the question 'What is your modeling process?' is to have the process written out.  That way, you can use it as a guide to not only talk to someone about your process, but also to use it as a reference while you are modeling.  Further, as I learn and understand more about modeling in my field (since those
days in class at SUNY UB) my process has improved and evolved with me over time.  Keeping a record of my process and updating it regularly helps me to keep it fresh and vibrant when I need it the most.

So, you might be interested in knowing what my process is?  I present it to you here with the understanding that while my process may not be your process...for what I do in financial services it has served me well.

1--Decide what the purpose of your model is

This speaks not only to your dependent variable, but also the question that you're hoping a model you create will answer.  If you are creating a model for another party (i.e. a business line you serve), this would be a good time to meet with them to understand what they want the model for.  In working with your business line, don't be afraid to ask lots of questions.  I have found that the better picture you can get of what you are trying to do for your business line (or the question you are trying to answer) the more confidence they will have in what you are doing for them, and the end-product will be more useful and pertinent to them.

2--Decide on the type of model/approach

To me, choosing the type of model is dependent on the dependent variable you are trying to predict.  Is your dependent variable continuous or binary?  Are you trying to predict one or more outcomes?  Is the final report you are providing results for segmented in any way?  Do you have enough observations to create a meaningful model at all, or do you need to utilize a more trended/averages approach?  If a regression model is not necessary, is there a mathematical formula that could be of help to solve a problem?

I'd like to share an example in the spirit of the last question.  Once I was asked to provide analysis to understand if increasing money spent on a given campaign would result in more deposits coming in.  The answer was needed quickly so that my business line could get the necessary funds to do their promotion.  Once I knew what inputs I had and an understanding of what would answer it, I was able to choose a method (in this case I used a calculation of multiplier effect from Economics) to answer their question.  They were happy with the results and their funds were secured for the promotion.  Could I have chosen a more complex method?  Of course I could have.  But, knowing the deadlines and how soon they needed the answer led me down the road to a much more simple, yet meaningful approach.

3--Set reasonable deadlines, adjust your approach accordingly

In a textbook, they may not talk a lot about this step.  And for good reason.  In academics it may seem that you have all the time in the world to build a model.  In the movie Indiana Jones and the Last Crusade, Indiana's father (played by Sean Connery) dedicated his whole life to the pursuit and research of the Holy Grail. 

In business its just not like that.  Business lines you serve have deadlines and (even if its not apparent to you), you are in competition with other organizations and analysts trying to keep ahead of the analysis and success curve.  And, whether you realize it or not, how long you take to build a model can speak to others at how effective you are as an analyst compared to others you work or compete with.  This is another good reason to write out your process and adjust and improve it over time to create efficiency gain.  Further, if you followed step 1 well, you will know when your model is needed and just how much time and effort you need to put into it to bring it to light.

Let me provide an example...Once I was tasked to create a model in anticipation of regulator visits to our organization.  I did not have a lot of time.  Plus, I was new to the data, new to using the software used to create the model (I had used a different modeling program prior) and it was unclear how much data I really had access to.  In order to set a reasonable timeline and decide if my approach was feasible (I had originally intended to do a logistic regression), I first did a survey of the data available to me that would most likely to be useful to predict my dependent variable.  As I surveyed the internal data at my disposal, I noticed a lot of blank and incomplete data.  With this understanding, I expanded my search for data to complement this data; I was able to access data from a third party vendor that had proven useful to other analysts in my group.  Then,
  • I knew I had to create predictions for two specific groups that would be used as an input to a calculation downsteam
  • I knew that the CHAID approach dealt well with missing variables and did not require a lot of time intensive data preparation (like the logistic regression did)
  • I knew I could create CHAID quickly and it would be palatable to the end users in my organization
  • With this knowledge, I was able to create a reasonable timeline and an improved approach
Now, some might accuse me of 'taking the easy way out', but I don't see it this way at all.  There have been times when a more complicated approach is necessary...but in this case, the CHAID approach was realistic and still meaningful.  Plus, what was important is that I made my deadline and I made my organization and the regulators happy.

4--Prepare your dependent and independent variables, Discovery phase

This is my favorite part of the model building process.  This is the time to discover, to understand, to build and create what your model will ultimately be.  This can also be the part where you can see the weaknesses and limitations of your model (particularly in terms of how well you might be able to explain your dependent variable).

Here are some ideas/thoughts for this step:

Load variables into the model that make sense or where its been seen have a positive and negative correlation with the dependent variable.  This could take some research and really trying to understand what predicts what you are trying to answer.  Once I know what data I have access to, I typically run several significance tests (i.e. Mann-Whitney, Chi-square, t-test, etc.)  or correlation tests to get a sense of what may be useful and what is not.  Plus, I can eliminate variables that I know will not be predictive (i.e. based on how its segmented or structured in terms of the dependent).  Just remember that when you run your hypothesis tests that you test your assumptions and you are using the right tests (i.e. you might need to run a Shipiro-Wilk test for normality).

Create meaningful interactions.  This is also fun.  With this step, I typically have taken the approach of trying to create interactions myself...based on what I know might be predictive of the outcome.  Further, if I have found that some variables are not meaningful by themselves, I will try interacting them with variables that are.  I have found that some of the best interactions created were ones that I created (vs. ones that my software program created for me).

Ask the business line for variables they think might be predictive.  I have found this extremely useful.  People on the front lines of business (or at closer than you are) know and understand behaviors because they work with them every day.  If you can understand what behaviors they think lead to your outcome and you are able to quantify that into a variable for a model, it will only serve to enhance your model that much more.  Plus, it will create a better relationship with your business line as they will feel like they are an integral part to your process (you will become a partner with them instead of them just waiting and being a spectator to your work).

Determine how many variables you can use for your model.  You can measure this using a degrees of freedom approach.  Or, if you know your business well enough, you will get a sense of how many variables are a good set to try and build into your model.  For most of what I do, I try to come up with a subset of 200-300 variables as a goal.  Through the process of forward or backward elimination this will usually go down to 7-10 variables.  For everyone it will be different.

Make sure your dependent variable is accurate.  I have found that its easy to overlook the dependent variable.  I don't know if that's because its just one variable and the independents typically take more time to get through.  If you read about modeling, I have found they mostly focus on the independent variables as well.  Let me just say this...if the dependent variable is not constructed right, you not may be answering the question you want to answer and therefore will create an unintended weakness in your model.  That weakness can cause other problems downstream from the model that you may not have expected.  When I look at a dependent variable I typically look at two things:
1:  It's timing.  For example, if you are doing a behavior model, you might focus on how much of a forecast period you need...therefore your dependent would lie within that period)
2:  The event or condition you are trying to predict.  For a binary model, you may not just want target one generic group...you may want to target an enhanced version of this group.  I like to use the example of targeting loan applications in general or targeting loan applications that are also approved.  Or, savings accounts in general, or targeting savings accounts with high average balances that are more profitable.

For more on this subject see another blog I wrote:  http://enduranalyst.blogspot.com/2012/04/dependent-variable-normalization-in.html

Take care of your exclusions (if applicable).  For more on that, read:  http://enduranalyst.blogspot.com/2012/07/the-model-exclusion.html

Transform variables as needed for better fit or to make it easier for the model to use them.  I could do an entire article just on this topic, but let me at least put a few of the things I might do for this:
  • If you are doing a logistic model, you will typically need to transform continuous variables so they function better with your dependent variable.  See this article for ideas on this:  http://enduranalyst.blogspot.com/2013/01/using-aic-to-fit-continuous-variable.html
  • Logistic and linear models will not take variables that contain text or that are grouped into worded segments.  With these kinds of variables you will need to make them binary (remember, the number of groups minus 1 rule) or if they have an intrinsic order, you could assign them scores based on a perceived value order.  For example, if your variable is low, medium, medium-high, high...you can make it 5, 10, 15, 20--with the assumption that the degrees are equally distributed.  Here is some more on segmentation and modeling:  http://enduranalyst.blogspot.com/2012/05/advantages-of-segmentation-for-modeling.html
  • Group variables to maximize the use of the dependent variable.  Sometimes data can be ordered in a way that is not statistically significant if left as is.  To improve this, you can experiment by grouping together groups that make sense, therefore causing more of the dependent variable to appear in a larger population.
  • Take care of outliers.  Sometimes variables are not useful because of outliers.  I typically use Q-Q plots to first discover outliers.  Then, I use a method to deal with them.  That could mean either taking them out or setting them to a reasonable average statistic.
5--Run the model, back-test, refine

Here is where I rely on different statistical measures to help me choose which models to use or not.  For example, I may use AIC or BIC to do model comparisons.  This suggests the approach as well of competing models.  Here I would also use the Variance Inflation Factor (VIF) to test for multi-collinearity and removes variables that are extraneous (to read more about this go to:  http://enduranalyst.blogspot.com/2012/05/variance-inflation-factor-vif.html).  I will also rely on measures like the KS statistic and the Gini Coefficient to determine which model provides the best lift. 

Here is where I will also back-test the model.  This part is extremely important and to me gives you the extra confidence and comfort level to those planning to use the model.  If you have enough data for the current period you are modeling on, you can split the data and run a test of the model on one set and a clean run on another. Otherwise, you can score your model set and use it to score an out-of-time, out-of-sample group in the past.  You can see how close your predictions vs. actuals are for this set.  If something is very wrong with the results of this out of sample test, you may need to rewrite the model or adjust the model's results with a factor to deal with the discrepancy.

6--Present the results

If you document as you go (see step 7), this part is easier.  And it will be easy to formalize the model and its results to those you present it to.  Plus, if you know your audience, you will be able to word the document in such a way that it is understandable to those you present it to.  Here is the typical format I use for showing my model results:

A.  Executive Summary
  • Here I would list the purpose of the model and any exclusions I did
  • I will note the strengths and weaknesses of the model
  • I might include instructions on how to implement the model and who to work with on my team in doing it
B.  Time Period the model was built on (in graphic form)
  • This shows the population the model was built on and I like to show the event and the performance window of the dependent variable.
C.  Which variables were tried and ultimately used
  • I also suggest ranking them in importance to the model.  Many models will provide these statistics for you.  Otherwise, I used the size of the coefficients and their significance level as a guide for ranking.
D.  Back-test results, Model Strength statistics

E.  Next Steps/Expectations
  • Here I might state things that can be done to enhance the model in the future
  • I might also go into more detail around what needs to happen for implementation
  • With any model, I think its also important to make it clear the assumptions a model has and the weaknesses of it.  I usually state that while no model is or ever will be perfect...this model will get us closer than we've ever been to providing value on predicting what we want.
F.  Appendix
  • Basically, anything that is granular, interesting but not necessary to present in A-E, I will include here.
  • The Appendix could be referred to during the presentation to resolve any concerns as you present. 
    • On this note, one thing I like to do before any presentation is to anticipate the kind of concerns people in the group might have and be prepared to answer them with results.  These kind of results definitely belong in the Appendix.
  • Another thing I like to include in the Appendix (which usually someone will bring up) is reasoning behind the approach I chose.  Here I might use sources I discovered through academic research, or outline my thought process to choosing the methodology I did.

7--Document as you go
  • Throughout each step of the model journey I've outlined, I take careful notes with the results presentation (Step 6) in mind.  Further, its helpful to organize your spreadsheets and supporting documentation in such a way that it will be easy to access as needed.


No comments:

Post a Comment