The Statistical Analyst Effectiveness Group (SAEG): May 2012

Sunday, May 27, 2012

The advantages of segmentation for modeling

By Kirk Harrington, with special thanks to Ben Nutter (SAEG partner)

Segmentation can be advantageous for modeling for either 1) explaining for descriptive purposes or 2) to identify groups that behave a certain way (in context of the dependent variable).

When I approach segmentation for modeling, I use this pattern:

**Determine whether you want to explain a population in general OR define a behavior
**Create useful segmentations based on statistical analyses of the population you are working with
**Decide (or experiment) with different ways to implement the segmentation into your modeling

Segmentation to explain

Using a tree analysis
With no dependent variable in mind, this type of segmentation just seeks to explain or describe a population that you want to understand. If for example you have a population in your data that you want to understand, you can create a binary variable to label them (1=population of interest). Then, you could perform a decision tree analysis on them (like CHAID), making that population of interest your dependent variable. For this purpose, you could also expand the tree past more than one level, so that you are capturing more variables.

With the results of the tree analysis, if there is a variable that best describes something unique about the population, you can take those results and create a segmentation that explains a particular aspect of your population of interest.

One-off analyses
The limitation of doing a tree analysis to explain a population is that it just focuses on the most statistically significant variables that help describe the 1. If you simply want to make observations about a population without regards to this, doing several one-off analyses of the population (and recording percentages of variable stats) can be useful. Then, if you have other samples of that population back in time (or you can hold them forward in time), you can validate the observations you made with your beginning population to see if the percentages hold. Out of this analysis, you will identify segmentation that is consistent over time and segmentation that varies over time. Of course, you must keep in mind statistical significance testing if you plan to share your results officially.

Using a population distribution
Another simple method to segment a population (particularly with continuous variables) is to run out a population percentile distribution (or even a scatter plot) to understand how the population is dispersed on the variable. Then, if you notice 'cuts' that make sense based on the disbursement, you would create segmentation code that factors in your observations.

Using statistical clustering methods
One common method, that will be its own topic on SAEG eventually, is the K-means cluster algorithm. Without going into detail about the algorithm, the basic things to understand about it are...

You first need to tell the algorithm how many clusters you want to do and how you want the first cluster to be defined. The algorithm is about Euclidean distance from one individual in your dataset to another. An example of how you want the first cluster to be defined is to choose the two individuals that are farthest away from each other in terms of the distance.
The first clusters location 'in space' becomes a centroid to compare and allocate other individuals to based on distance to them. Once a new clustering is formed, a new mean centroid is created and individuals are checked again to determine if the new cluster is what they are actually closest to.
If they are truly not close enough to the new mean centroid, they are re-allocated to the cluster they are closest too and the distance checking continues to happen until no new re-allocations occur.

Their are obviously limitations to the K means clustering (particularly if there are outliers in your population), but this can be a very useful tool to segment based on a rigorous mathematical procedure. If you would like to read more about this method before we do a special topic on it, check out these websites:

http://mnemstudio.org/clustering-k-means-introduction.htm

http://mnemstudio.org/clustering-k-means-example-1.htm

Behavioral segmentation
Using a tree analysis
You can also use a tree analysis to create segmentations for behavioral-type variables. The dependent variable is the behavior variable (1,0).

Using a logistic regression
Using cross-sectional logistic regression, I've had success creating behavior segments with varying sums of an events for varying time periods. Here's an illustration:

Basically, each variable you create is the sum of the behavior events within each given time window. For example, this could be last 30 days, last 60 days, last 90 days etc. In deciding how far back to go for your time periods, it is important to understand the cycle time leading up to the event you are trying to predict. How far back does the behavior event go for a given individual? As long as there are no other events (or actions taken by the individual) prior to the one you are trying to predict for the study, what is the population distribution of the event leading up to the event predicted? How far does it make sense to go back based on the behavior being predicted? Do you have data in the past where you could study this behavior leading up to a similar event, so you can get a better sense of time and the behavior events leading up to the event? These and other questions could be discovered out through analysis so that a reasonable time window can be established.

One-off analyses
As with segmentation to explain, several one-off analysis, with attention to validating historically (with statistical significance testing) the behavior leading up to the predicted event can be useful in discovering useful segmentation for modeling and analysis. This is a little trickier though, since behaviors can vary and there may be more instances of someone migrating from one segment to another. The key is to find segments that represent a good separation of the dependent variable and that can be shown to be stable in their behavior over a period of time.

Using your segmentation in a model
Depending on the purpose of a model, a segmentation can be used in various ways.

For a decision tree model, a segmentation can just be classified as one of your independent variables (even if it has missing values--which might be a group you were unsure of in defining). Further, I have found it useful to experiment with many different cuts of a particular segmentation so that it gives the tree many opportunities to 'latch onto' a segmentation it prefers. Some programs though (like SPSS for example) require you to decide what type it is (i.e. ordinal, nominal). If it is not defined correctly, you may get results that do not make sense.
For a logistic or linear regression its a bit trickier. Certainly you could create a binary variable for each segment, however they may have differing population proportions (or weights) within the population as a whole. Plus, each group may behave differently based on the other independents in your regression. To get around this, you could create interaction terms for each segment * (times) the other independent variables. That way, the regression line would be different depending on which segment is being accessed within the formula. Also, if you are able to determine that two or three populations act very differently and the set of explanatory independents would therefore be very different for each set, you could do separate models for each segment so that their probabilities would be more stable as a result.

An Effective Analyst Thought:
Anticipating the needs of your client is like having a regression ongoing in your head while you work for them...once you've observed their behavior and understand what it is that increases their satisfaction and defines their needs, you can learn to predict what it is they need and produce accordingly.

Friday, May 18, 2012

BONUS topic! Try analyzing something different!

By Kirk Harrington, SAEG Partner

As some of you know, I created and help run an online campaign which seeks to raise awareness of my favorite band, and to promote their induction to the Rock and Roll Hall of Fame in Cleveland, Ohio
(where I currently reside). As part of this effort, we have sites dedicated to fans on Facebook, Blogger, Twitter, Youtube, and Myspace. We've been around for about 2 years now and one thing we struggled with was the flow of new fans onto our Facebook page and their interest and 'engagement' in our posts. Further, this is the page I see as our highest visibility point to drive fans to our other sites and especially to our petition (to see that petition, you can go to www.ddttrh.info and click 'Sign Petition').

To help mediate the struggles, I decided to perform analysis on this space, with particular attention to our posts. The nice thing about being a statistical analyst is that you can take almost any topic that you can think about and if you have a decent knowledge of it, can run useful analysis to drive successful results. My pattern was simple...

Decide the purpose of the analysis-->Determine available data-->Determine the best approach to analyze the data-->Prepare and clean the data-->Perform analysis, interpretation, and provide results in a geared-to-my-audience format.

In regard to providing results in geared-to-my-audience format, I had to have my end audience in mind. I work with an amazing group of people who are dedicated to DDTTRH, yet their backgrounds differ. One is an IT person, another an artist (our publicist), and another I see with skills in project management. When I wrote up my analysis, I had to think of them and the best way to show them the kind of results that could lead to actionable items to improve the engagement of our fans.

Another exciting part of this analysis was determining the kind of data (and any limitations) that Facebook has to analyze posts. One limitation is that I could not go beyond a year, for example (I was hoping to go back to inception). Also, there were certain assumptions I had to make based on the kind of data that was available and how far it would go to explain a person's behavior. For example, I had to consider a 'like' or 'comment' as an engagement, though it could be argued that someone could read a posted article and not like or comment at all.

Also, I was hoping to do a linear regression, but because of the data limitation I decided to do just a regular trends analysis and present results in a meaningful way that would give us a guide or shooting a little closer to target. And this is something I share with people I do analysis for. Sometimes analysis (particularly for marketing) does not have to be perfect and extremely refined the first time out. If you think of a dart board, sometimes you could be shooting off the board and scoring no points and a fair initial analysis could get you at least on the board closer to target. Then, as you start to understand the data, its limitations, etc, you can find additional data sources to 'light your way' and can schedule to hold necessary data for more refined analysis next time.

As part of the analysis as well, I did an experiment where I posted something that I guessed (based on my analysis) would generate engagement, and I'm happy to say that my experiment was a success.

"If you think of a dart board, sometimes you could be shooting off the board and scoring no points and a fair initial analysis could get you at least on the board closer to target."

So, the end result once I compiled all my results was to share the results with my staff and to discuss any follow-up actions. So far, the results have been good. Not only were we able to discover what kind of posts would spark engagement of our fans, but also, because I added a time dimension, we were able to discover peak times where fans could be most engaged. I have seen noticeable improvements in the quality of our posts and in additional fan traffic because of using this analysis to our benefit.

If you would like to see this analysis, please email me at enduranalyst@yahoo.com and I can Google Docs it to you.

I for one really enjoyed doing it. Analyzing something different opened up a perspective and adventure that I would not normally have found if I didn't try it. So get out there analysts! Analyze something you enjoy and take an adventure...you will find it healthy and meaningful, I promise!

Kirk

Tuesday, May 1, 2012

The Variance Inflation Factor (VIF)

By Kirk Harrington, special thanks to Benjamin Nutter (my biostats friends and partner in SAEG)

To check and correct for multi-collinearity in a linear-type regression, a great tool to use is the variance inflation factor. Here's how to do it.

First, run each predictor against every other predictor in your model. After you do this, you will get an R-Squared for each. Then, calculation a VIF for each R-Squared (which corresponds to a given predictor). The formula for the VIF is VIF=1/1-R-squared.

Then determine a suitable target VIF. For example, a good target for you industry might be 5 or 3 (if you need more refinement). A biostats friend of mine said you can even go up to 10--depending on the level of precision that is required in your field. An engineer or pharmaceutical science may want 3, whereas someone in credit risk may want 5. Someone in marketing may be willing to accept 10.

To correct for multicollinearity, start by taking out the variable with the highest VIF (that is more than your target VIF). Then, rerun the VIF for each predictor again (you should see the VIFs declining). Repeat this process until your variables are all within your target VIF.

------------------

An Effective Analyst Thought

A good tool that you can build for use in your modeling adventures is an inventory of the types of predictors that you ever found useful to you (or interesting to you) as you have built or validated models. You can build a spreadsheet that says the name of the predictor, what type of model it was used in, any formulas associated with it, and what type of effect it was trying to explain in the model. Once you build this inventory up enough, this is great to use to get ideas as you build other models, to provide ideas to improve models after validating them, to talk about during job interviews, or to share with other analysts and their modeling efforts.