The Statistical Analyst Effectiveness Group (SAEG): The Model Exclusion

Model exclusions are populations to that are excluded when doing your model build. Model exclusions are typically done because they strip out observations that are not pertinent to the modeling exercise (i.e. there is no need to predict on them) or they create undue noise that inhibit the model from making accurate predictions.

Here is a general list of model exclusions to consider:

· Populations that may not exist today (for example--I once built a model where I had to remove new account originations from the modeling period because they did not exist in more current periods)

· Records with an abnormal amount of missing data

· Out of behavioral sync

o This refers to populations that may be in your data, but whose behavior is ‘out-of-sync’ in comparison to more current populations. I ran across this once when I was validating a model done by a third party vendor. They did not have information regarding the release date of a new product and our data happened to include observations of an old product prior to the new product’s release. The new product’s release was a significant change that affected the entire product line…and the old products were changed to the new one. It was clear from the data as well that prior and after the product release date the two populations (with their differing attributes from the two time periods) behaved very differently. To make the model more effective on current populations, they should have excluded any populations prior to the new product's release.

· Populations that are not specific to the goal or objective of the model. For example, if the model was meant for Population A and your data has in it Population A,B & C, your model exclusion would be Populations B &C.

· Populations with a status that does not make sense to include. For example, you would not want to include accounts that are closed, customers that are deceased, observations that are already in the late stages of what you are trying to predict (to name a few examples).

· Duplicate records

· Records with key missing identifiers. I include this one because typically when you are creating a model, there may be key identifiers that you use to ‘match’ to append data for dependent and independent variables. If a key identifier for matching is missing, the data you try to append may not exist and will generate blanks that will cause modeling problems.

· Observations that for whatever reason show values that are likely data errors. For example, a negative balance that is likely negative for accounting reasons but should not exist in the normal behavior time window.

· Here's a few for time series...

o If the first observation does not make sense to include. I ran across this once where a model was done for balance change and the first balance observation was 0. This was likely due to an 'accounting' period where the account was open, but was not holding balances yet.

o Observations that occur 'after the predicted event' that may be there for accounting reasons.

o If a static date variable (or type variable) should remain static, but doesn't…and the reason behind the change is vague and not explainable. An example of this might be origination dates that change for an account throughout a series. This is likely a data problem that to remedy you would either do research to fix the problem or remove the account and all its observations entirely.

An Effective Analyst Thought

When doing my work for a company, I often keep in mind where the company is on its learning curve with data analysis and modeling and create opportunities to teach those I work with to help them along the curve. This may entail teaching those without a statistics background how to do simple statistical techniques to enhance the reporting they do to senior management. The farther along a company is on its learning curve with analysis, the more they can be competitive with their peers in the industry....and YOU can be a driving force to get them there.

The Statistical Analyst Effectiveness Group (SAEG)

Tuesday, July 31, 2012

The Model Exclusion

No comments:

Post a Comment

About Me