Segmentation can be advantageous for modeling for either 1) explaining for descriptive purposes or 2) to identify groups that behave a certain way (in context of the dependent variable).
When I approach segmentation for modeling, I use this pattern:
**Determine whether you want to explain a population in general OR define a behavior
**Create useful segmentations based on statistical analyses of the population you are working with
**Decide (or experiment) with different ways to implement the segmentation into your modeling
Segmentation to explain
Using a tree analysis
With no dependent variable in mind, this type of segmentation just seeks to explain or describe a population that you want to understand. If for example you have a population in your data that you want to understand, you can create a binary variable to label them (1=population of interest). Then, you could perform a decision tree analysis on them (like CHAID), making that population of interest your dependent variable. For this purpose, you could also expand the tree past more than one level, so that you are capturing more variables.
With the results of the tree analysis, if there is a variable that best describes something unique about the population, you can take those results and create a segmentation that explains a particular aspect of your population of interest.
One-off analyses
The limitation of doing a tree analysis to explain a population is that it just focuses on the most statistically significant variables that help describe the 1. If you simply want to make observations about a population without regards to this, doing several one-off analyses of the population (and recording percentages of variable stats) can be useful. Then, if you have other samples of that population back in time (or you can hold them forward in time), you can validate the observations you made with your beginning population to see if the percentages hold. Out of this analysis, you will identify segmentation that is consistent over time and segmentation that varies over time. Of course, you must keep in mind statistical significance testing if you plan to share your results officially.
Using a population distribution
Another simple method to segment a population (particularly with continuous variables) is to run out a population percentile distribution (or even a scatter plot) to understand how the population is dispersed on the variable. Then, if you notice 'cuts' that make sense based on the disbursement, you would create segmentation code that factors in your observations.
Using statistical clustering methods
One common method, that will be its own topic on SAEG eventually, is the K-means cluster algorithm. Without going into detail about the algorithm, the basic things to understand about it are...
- You first need to tell the algorithm how many clusters you want to do and how you want the first cluster to be defined. The algorithm is about Euclidean distance from one individual in your dataset to another. An example of how you want the first cluster to be defined is to choose the two individuals that are farthest away from each other in terms of the distance.
- The first clusters location 'in space' becomes a centroid to compare and allocate other individuals to based on distance to them. Once a new clustering is formed, a new mean centroid is created and individuals are checked again to determine if the new cluster is what they are actually closest to.
- If they are truly not close enough to the new mean centroid, they are re-allocated to the cluster they are closest too and the distance checking continues to happen until no new re-allocations occur.
Behavioral segmentation
Using a tree analysis
You can also use a tree analysis to create segmentations for behavioral-type variables. The dependent variable is the behavior variable (1,0).
Using a logistic regression
Using cross-sectional logistic regression, I've had success creating behavior segments with varying sums of an events for varying time periods. Here's an illustration:
Basically, each variable you create is the sum of the behavior events within each given time window. For example, this could be last 30 days, last 60 days, last 90 days etc. In deciding how far back to go for your time periods, it is important to understand the cycle time leading up to the event you are trying to predict. How far back does the behavior event go for a given individual? As long as there are no other events (or actions taken by the individual) prior to the one you are trying to predict for the study, what is the population distribution of the event leading up to the event predicted? How far does it make sense to go back based on the behavior being predicted? Do you have data in the past where you could study this behavior leading up to a similar event, so you can get a better sense of time and the behavior events leading up to the event? These and other questions could be discovered out through analysis so that a reasonable time window can be established.
One-off analyses
As with segmentation to explain, several one-off analysis, with attention to validating historically (with statistical significance testing) the behavior leading up to the predicted event can be useful in discovering useful segmentation for modeling and analysis. This is a little trickier though, since behaviors can vary and there may be more instances of someone migrating from one segment to another. The key is to find segments that represent a good separation of the dependent variable and that can be shown to be stable in their behavior over a period of time.
Using your segmentation in a model
Depending on the purpose of a model, a segmentation can be used in various ways.
- For a decision tree model, a segmentation can just be classified as one of your independent variables (even if it has missing values--which might be a group you were unsure of in defining). Further, I have found it useful to experiment with many different cuts of a particular segmentation so that it gives the tree many opportunities to 'latch onto' a segmentation it prefers. Some programs though (like SPSS for example) require you to decide what type it is (i.e. ordinal, nominal). If it is not defined correctly, you may get results that do not make sense.
- For a logistic or linear regression its a bit trickier. Certainly you could create a binary variable for each segment, however they may have differing population proportions (or weights) within the population as a whole. Plus, each group may behave differently based on the other independents in your regression. To get around this, you could create interaction terms for each segment * (times) the other independent variables. That way, the regression line would be different depending on which segment is being accessed within the formula. Also, if you are able to determine that two or three populations act very differently and the set of explanatory independents would therefore be very different for each set, you could do separate models for each segment so that their probabilities would be more stable as a result.
Anticipating the needs of your client is like having a regression ongoing in your head while you work for them...once you've observed their behavior and understand what it is that increases their satisfaction and defines their needs, you can learn to predict what it is they need and produce accordingly.

