The Statistical Analyst Effectiveness Group (SAEG): April 2014

Friday, April 11, 2014

The Value of an Index

I would put this topic in the category of pretty basic, but useful to remember. The simplicity of computing an index value, based on the calculation is evident…interpreting it, or understanding how to interpret it is where the value of an index gets more interesting.

The way to calculate an index is as follows:

The main thing you need to have before calculating an index is a data set. Lets use the example of population by state as an example. Let’s say I have a sample of 10 states and I want to prepare an index value based on the mean of the population (so if there is a new value, I want to know where it stands as compared to the other states in my sample). Here is a sample of 10 states (all based on real data from the April 1, 2010 US Census Bureau stats (Annual Estimates of the Population for the United States, Regions, States, and Puerto Rico—NST-EST2013-01—link: http://www.census.gov/popest/data/state/totals/2013/), with some basic descriptive statistics:

Alabama: 4,779,739

Florida: 18,801,310
Idaho: 1,567,582
Utah: 2,763,885
New York: 19,378,102
California: 37,253,956
Ohio: 11,536,504
Vermont: 625,741
West Virginia: 1,852,994
Pennsylvania: 12,702,379

Mean: 11,216,219

Median: 8,158,120
Minimum: 625,741
Maximum: 37,253,956

Now, if you want to create an index value based on this sample and say, its mean as the ‘base or comparison value’ (and see how new states you bring in compare to that index), take these steps. Take each value in your sample / the mean * 100. You will get the following results:

It is also helpful to do a distribution of your values in a graph:

As you can tell, you index ranges from 5.62 (Vermont) on the low end to 334.83 on the high end (California). To interpret this, say for Ohio’s at 103.69…if you take this index value – 100, you can say that Ohio is 3.69% higher than the average (11,216,219). Math-ing it out is…(103.69-100=3.69), then take 11,216,219 + (3.69*11,216,219) = est. 11,630,097. Obviously, this is not exact because I only went out two decimal points. The further you go out in decimal places, the closer you will get to the 11,536,504 actual (you get the picture). If you state is below the mean, this also works. Vermont, for example. Take 5.62-100 = -94.38, meaning that its population is 94.38 percent LESS than the average. Math-ing that out comes to… 11,536,504-(11,526,504*.9438)=11,536,504-10,888,152.4752=645,351.5248. Again, this is not exactly the actual value of 625,741 because I went out only 2 decimal places (its close enough though for example purposes).

Now lets say you want to bring in a new state, say Texas, population 25,145,561. If you take its value divided by your average * 100, its index value is 224.19 (or 124.19% higher than the average). That is somewhere between Florida and California in your sample. Thus, any new state can be brought in and measured against your original sample average of 11,216,219.

Note that you can really use any ‘base’ value to compare against, depending on your purpose. If you used minimum, it would be creating an index that compares against Vermont. If you use the Median, it would be creating an index against a midpoint value for the states in the sample you have. If you wanted to, you could take an average of all states and use that as the base value for comparing it against the population of a region within another country (for example). Really, your choice of the base is dependent on your purpose and you will know what that is.

Now, lets say you want to simplify things and make index values that have less variation (especially since in our population, California is so high and Vermont is so low). To accomplish the creation of a less varied index, you can transform the population first, then calculate your base value and create an adjusted index (understanding that if you want to come out with a reasonable value that can be interpreted, you have to un-transform afterwards). For simplicity’s sake, I will only look at the square root and log transformations. The results and adjusted indexes come out as follows:

You can also tell in looking at graphs of the adjusted index, that there is less variation (notice also the means are closer to the medians with the transformations):

When interpreting the differences from your base, its different. Lets say you use the square root transformation and your state is Ohio. Its index value for the square root transformation is 116.90. That means it is about 16.90% higher than the square root average of the population. Math-ing that would be…2906+(2906*.169)=3397.114. Notice how this matches the closely to the actual figure in the table above. To convert this back to the original population figure, of course you square it = 11,540,354 (again, it’s the rounding that gives you the estimated number, not the exact one).

Monday, April 7, 2014

Some Model Validation Thoughts

Some Model Validation Thoughts - Part 1, A Structured Approach idea

As I mentioned, I wanted to share some model validation tips I have picked up over the years having worked with financial service regulators, from reading regulatory guidelines, and developing out my approach. In the below, I mainly focus on OCC Bulletin 2011-16 (which can be found at http://ithandbook.ffiec.gov/media/resources/3676/occ-bl2000-16_risk_model_validation.pdf). I will likely share more on this subject, but I hope you find this a good start (and useful if you are in the position of validating a model).

Kirk Harrington, SAEG Partner

If I had to outline my model validation approach, it would be as follows:

DATA PIECEàMETHODOLOGY PIECEàMODEL STRENGTH PIECEàCODING/INPUT CHECKSàISSUES RESOLVE

Why DATA piece:

In OCC 2000-16, “Validating the Model Inputs Component” is its own section. It says “It is possible that data inputs contain major errors while the other components of the model are error free. When this occurs, the model outputs become useless, but even an otherwise sound validation will not necessarily reveal the errors. Hence, auditing of the data inputs is an indispensable and separate element of a sound model-validation process, and should be explicitly included in the bank’s policy.”

In my validation of the DATA, I focus on replicating/understanding:

· Summary and key result tables. What I mean by key result tables is that if a variable is being used in the model and has a specific population proportion stated in the documentation, I validate that proportion.

· Figures with key rates, i.e. Percentages, averages, etc. Mainly I try to focus on figures that if off, would affect the model’s results.

· Exclusions—Are exclusions appropriate, do the numbers of exclusions match? If too many exclusions happen, this could erode the model’s effectiveness.

In my mind, it is crucial to understand the population that is used prior to running the model. If this inputs piece is inaccurate, anything run after it is highly questionable.

Why METHODOLOGY PIECE?

Models can be complex and what formulas are used and how they are translated and if a given approach is appropriate is highly critical.

The OCC 2000-16 says this: “Implementing a computer model usually requires the modeler to resolve several questions in statistical and economic theory. Generally, the answer to those theoretical questions is a matter of judgment, though the theoretical implementation is also prone to conceptual and logical error.” Later, it continues…”Regardless of the qualifications of the model developers, an essential element of model validation is independent review of the theory that the bank uses.” It also goes on to talk about how comparing to other models, either at the bank of publicly available, is also useful.

My approach to this piece typically involves:

· Doing outside research to see how/where a given methodology is used. For this, I may go through a digital database containing papers on financial models (for example) or talk to friends I have in the industry that may be using similar approaches (to get their feedback). If I can’t find anything, I might also go through the exercise of breaking down the formula to understand its various components and if those components are reasonable.

· Understanding if the variables used in prediction are reasonable (and if the coefficients and their signs make sense)

· Understanding of the dependent variable, how it is calculated, and if it is appropriate for the type of model used

· Understanding how the model is translated and if that translation is accurate given the type of model it is.

Why MODEL STRENGTH PIECE?

While not explicitly states in OCC 2000-16, this piece helps to understand key matters mentioned, like model results, code and mathematics. There are various metrics and tests that can be created for a given model, and those metrics can speak to a model’s fit, specification (if it has enough variables to show a complete picture of prediction on the dependent variable), and strengths (or weaknesses). For example, if a model has a low R-Square, that can be a sign of having weak predictive power. However, if it is a pseudo R-squared, this is harder to interpret. In this case, it is helpful to see models done similarly to judge the pseudo R-squared more accurately. Here is an article that speaks to this matter: http://www.ats.ucla.edu/stat/mult_pkg/faq/general/Psuedo_RSquareds.htm

More important than just judging a model based on its R-square and fits statistics is to look to out-of-sample or backtests performed at the time the model was developed. OCC 2000-16 also suggests that “model developers and validators should compare its results against those of comparable models, market prices, or other available benchmarks.” I typically look to see if these kind of tests are in the model documentation and do analyze them. I am especially wary of results that look to be ‘in-sample’ with exact fit.

Why CODING/INPUT Checks?

This is synonymous with checking the ‘Model Processing Component’. For me, this entails line my line proofreading of code (see Code and Mathematics section of 2000-16) on the model and checking the correctness of mathematics and formulas used. Something new I’ve added to my process (if feasible) is to construct an identical model to check coefficients and significance levels against those stated. Because I’ve prepared the data to check against tables and key figures, this part is not very difficult to accomplish. In OCC 2000-16 it does state that constructing an identical model is useful, especially if the model is simple (i.e. constructed from spreadsheets). For more complex models, the alternative approaches it suggests are the line by line reading of the code and benchmarking against available models.

ISSUES RESOLVE

If any issues are discovered in my process, I will bring these to the attention of the model owner first, then the model creator. I will typically only go to the creator if I am unsatisfied with the answers I receive from the owner or if it is in regards to a matter that I know only the model creator can answer. Here are some general rules I go by:

· With data, if there is a X% discrepancy (agreed upon with the model owner), I will flag it as an issue

· With model strength, I will raise it as an issue more so for informational purposes so that the model owner understands the weaknesses and strengths of the model

· With coding and input, I will check if signs, formulas, and coefficients are correct. I also like to focus on whether variables are being ‘prepped’ properly before going into the main model formula

· I will typically put any issue I find (that is not resolved immediately between the model owner, creator, and myself) on an issues log. I use this log to follow the resolve of issues throughout the validation process

Friday, April 11, 2014

The Value of an Index

Monday, April 7, 2014

Some Model Validation Thoughts

About Me