Here is a general list of model exclusions to consider:
·
Populations that may
not exist today (for example--I once built a model where I had to remove new
account originations from the modeling period because they did not exist in
more current periods)
·
Records with an
abnormal amount of missing data
·
Out of behavioral sync
o
This refers to
populations that may be in your data, but whose behavior is ‘out-of-sync’ in
comparison to more current populations. I ran across this once when I was validating a
model done by a third party vendor. They did not have information regarding the
release date of a new product and our data happened to include observations of
an old product prior to the new product’s release. The new product’s release
was a significant change that affected the entire product line…and the old
products were changed to the new one. It
was clear from the data as well that prior and after the product release date
the two populations (with their differing attributes from the two time periods)
behaved very differently. To make the model more effective on current
populations, they should have excluded any populations prior to the new
product's release.
·
Populations that are
not specific to the goal or objective of the model. For example, if the model
was meant for Population A and your data has in it Population A,B & C, your
model exclusion would be Populations B &C.
·
Populations with a
status that does not make sense to include. For example, you would not want to
include accounts that are closed, customers that are deceased, observations
that are already in the late stages of what you are trying to predict (to name
a few examples).
·
Duplicate records
·
Records with key
missing identifiers. I include this one
because typically when you are creating a model, there may be key identifiers
that you use to ‘match’ to append data for dependent and independent
variables. If a key identifier for
matching is missing, the data you try to append may not exist and will generate
blanks that will cause modeling problems.
·
Observations that for
whatever reason show values that are likely data errors. For example, a
negative balance that is likely negative for accounting reasons but should not
exist in the normal behavior time window.
·
Here's a few for time
series...
o
If the first
observation does not make sense to include. I ran across this once where a
model was done for balance change and the first balance observation was 0. This
was likely due to an 'accounting' period where the account was open, but was
not holding balances yet.
o
Observations that
occur 'after the predicted event' that may be there for accounting reasons.
o
If a static date variable
(or type variable) should remain static, but doesn't…and the reason behind the
change is vague and not explainable. An example of this might be origination
dates that change for an account throughout a series. This is likely a data
problem that to remedy you would either do research to fix the problem or
remove the account and all its observations entirely.
An Effective Analyst Thought
When doing my work for a company, I often keep in mind where the company is on its learning curve with data analysis and modeling and create opportunities to teach those I work with to help them along the curve. This may entail teaching those without a statistics background how to do simple statistical techniques to enhance the reporting they do to senior management. The farther along a company is on its learning curve with analysis, the more they can be competitive with their peers in the industry....and YOU can be a driving force to get them there.
No comments:
Post a Comment