Tuesday, July 31, 2012

The Model Exclusion

Model exclusions are populations to that are excluded when doing your model build.  Model exclusions are typically done because they strip out observations that are not pertinent to the modeling exercise (i.e. there is no need to predict on them) or they create undue noise that inhibit the model from making accurate predictions.

Here is a general list of model exclusions to consider:


·         Populations that may not exist today (for example--I once built a model where I had to remove new account originations from the modeling period because they did not exist in more current periods)

·         Records with an abnormal amount of missing data

·         Out of behavioral sync

o    This refers to populations that may be in your data, but whose behavior is ‘out-of-sync’ in comparison to more current populations.  I ran across this once when I was validating a model done by a third party vendor. They did not have information regarding the release date of a new product and our data happened to include observations of an old product prior to the new product’s release. The new product’s release was a significant change that affected the entire product line…and the old products were changed to the new one.  It was clear from the data as well that prior and after the product release date the two populations (with their differing attributes from the two time periods) behaved very differently. To make the model more effective on current populations, they should have excluded any populations prior to the new product's release.

·         Populations that are not specific to the goal or objective of the model. For example, if the model was meant for Population A and your data has in it Population A,B & C, your model exclusion would be Populations B &C.

·         Populations with a status that does not make sense to include. For example, you would not want to include accounts that are closed, customers that are deceased, observations that are already in the late stages of what you are trying to predict (to name a few examples).

·         Duplicate records

·         Records with key missing identifiers.  I include this one because typically when you are creating a model, there may be key identifiers that you use to ‘match’ to append data for dependent and independent variables.  If a key identifier for matching is missing, the data you try to append may not exist and will generate blanks that will cause modeling problems.

·         Observations that for whatever reason show values that are likely data errors. For example, a negative balance that is likely negative for accounting reasons but should not exist in the normal behavior time window.

·         Here's a few for time series...

o    If the first observation does not make sense to include. I ran across this once where a model was done for balance change and the first balance observation was 0. This was likely due to an 'accounting' period where the account was open, but was not holding balances yet.

o    Observations that occur 'after the predicted event' that may be there for accounting reasons.

o    If a static date variable (or type variable) should remain static, but doesn't…and the reason behind the change is vague and not explainable. An example of this might be origination dates that change for an account throughout a series. This is likely a data problem that to remedy you would either do research to fix the problem or remove the account and all its observations entirely.

An Effective Analyst Thought

When doing my work for a company, I often keep in mind where the company is on its learning curve with data analysis and modeling and create opportunities to teach those I work with to help them along the curve.  This may entail teaching those without a statistics background how to do simple statistical techniques to enhance the reporting they do to senior management.  The farther along a company is on its learning curve with analysis, the more they can be competitive with their peers in the industry....and YOU can be a driving force to get them there.

Sunday, May 27, 2012

The advantages of segmentation for modeling

 By Kirk Harrington, with special thanks to Ben Nutter (SAEG partner)

Segmentation can be advantageous for modeling for either 1) explaining for descriptive purposes or 2) to identify groups that behave a certain way (in context of the dependent variable).

When I approach segmentation for modeling, I use this pattern:

**Determine whether you want to explain a population in general OR define a behavior
**Create useful segmentations based on statistical analyses of the population you are working with
**Decide (or experiment) with different ways to implement the segmentation into your modeling


Segmentation to explain






Using a tree analysis
With no dependent variable in mind, this type of segmentation just seeks to explain or describe a population that you want to understand.  If for example you have a population in your data that you want to understand, you can create a binary variable to label them (1=population of interest).  Then, you could perform a decision tree analysis on them (like CHAID), making that population of interest your dependent variable.  For this purpose, you could also expand the tree past more than one level, so that you are capturing more variables.

With the results of the tree analysis, if there is a variable that best describes something unique about the population, you can take those results and create a segmentation that explains a particular aspect of your population of interest.

One-off analyses
The limitation of doing a tree analysis to explain a population is that it just focuses on the most statistically significant variables that help describe the 1.  If you simply want to make observations about a population without regards to this, doing several one-off analyses of the population (and recording percentages of variable stats) can be useful.  Then, if you have other samples of that population back in time (or you can hold them forward in time), you can validate the observations you made with your beginning population to see if the percentages hold.  Out of this analysis, you will identify segmentation that is consistent over time and segmentation that varies over time.  Of course, you must keep in mind statistical significance testing if you plan to share your results officially.

Using a population distribution
Another simple method to segment a population (particularly with continuous variables) is to run out a population percentile distribution (or even a scatter plot) to understand how the population is dispersed on the variable.  Then, if you notice 'cuts' that make sense based on the disbursement, you would create segmentation code that factors in your observations.

Using statistical clustering methods
One common method, that will be its own topic on SAEG eventually, is the K-means cluster algorithm.  Without going into detail about the algorithm, the basic things to understand about it are...
  • You first need to tell the algorithm how many clusters you want to do and how you want the first cluster to be defined.  The algorithm is about Euclidean distance from one individual in your dataset to another.  An example of how you want the first cluster to be defined is to choose the two individuals that are farthest away from each other in terms of the distance.  
  • The first clusters location 'in space' becomes a centroid to compare and allocate other individuals to based on distance to them.  Once a new clustering is formed, a new mean centroid is created and individuals are checked again to determine if the new cluster is what they are actually closest to.
  • If they are truly not close enough to the new mean centroid, they are re-allocated to the cluster they are closest too and the distance checking continues to happen until no new re-allocations occur.
Their are obviously limitations to the K means clustering (particularly if there are outliers in your population), but this can be a very useful tool to segment based on a rigorous mathematical procedure.  If you would like to read more about this method before we do a special topic on it, check out these websites:


Behavioral segmentation
Using a tree analysis
You can also use a tree analysis to create segmentations for behavioral-type variables.  The dependent variable is the behavior variable (1,0).

Using a logistic regression
Using cross-sectional logistic regression, I've had success creating behavior segments with varying sums of an events for varying time periods.  Here's an illustration:

Basically, each variable you create is the sum of the behavior events within each given time window.  For example, this could be last 30 days, last 60 days, last 90 days etc.  In deciding how far back to go for your time periods, it is important to understand the cycle time leading up to the event you are trying to predict.  How far back does the behavior event go for a given individual?  As long as there are no other events (or actions taken by the individual) prior to the one you are trying to predict for the study, what is the population distribution of the event leading up to the event predicted?  How far does it make sense to go back based on the behavior being predicted?  Do you have data in the past where you could study this behavior leading up to a similar event, so you can get a better sense of time and the behavior events leading up to the event?  These and other questions could be discovered out through analysis so that a reasonable time window can be established.


One-off analyses
As with segmentation to explain, several one-off analysis, with attention to validating historically (with statistical significance testing) the behavior leading up to the predicted event can be useful in discovering useful segmentation for modeling and analysis.  This is a little trickier though, since behaviors can vary and there may be more instances of someone migrating from one segment to another.  The key is to find segments that represent a good separation of the dependent variable and that can be shown to be stable in their behavior over a period of time.

Using your segmentation in a model
Depending on the purpose of a model, a segmentation can be used in various ways.
  • For a decision tree model, a segmentation can just be classified as one of your independent variables (even if it has missing values--which might be a group you were unsure of in defining).  Further, I have found it useful to experiment with many different cuts of a particular  segmentation so that it gives the tree many opportunities to 'latch onto' a segmentation it prefers.  Some programs though (like SPSS for example) require you to decide what type it is (i.e. ordinal, nominal).  If it is not defined correctly, you may get results that do not make sense. 
  • For a logistic or linear regression its a bit trickier.  Certainly you could create a binary variable for each segment, however they may have differing population proportions (or weights) within the population as a whole.  Plus, each group may behave differently based on the other independents in your regression.  To get around this, you could create interaction terms for each segment * (times) the other independent variables.  That way, the regression line would be different depending on which segment is being accessed within the formula.  Also, if you are able to determine that two or three populations act very differently and the set of explanatory independents would therefore be very different for each set, you could do separate models for each segment so that their probabilities would be more stable as a result.
An Effective Analyst Thought:
 Anticipating the needs of your client is like having a regression ongoing in your head while you work for them...once you've observed their behavior and understand what it is that increases their satisfaction and defines their needs, you can learn to predict what it is they need and produce accordingly.

Friday, May 18, 2012

BONUS topic! Try analyzing something different!

 By Kirk Harrington, SAEG Partner

As some of you know, I created and help run an online campaign which seeks to raise awareness of my favorite band, and to promote their induction to the Rock and Roll Hall of Fame in Cleveland, Ohio
(where I currently reside).  As part of this effort, we have sites dedicated to fans on Facebook, Blogger, Twitter, Youtube, and Myspace.  We've been around for about 2 years now and one thing we struggled with was the flow of new fans onto our Facebook page and their interest and 'engagement' in our posts.  Further, this is the page I see as our highest visibility point to drive fans to our other sites and especially to our petition (to see that petition, you can go to www.ddttrh.info and click 'Sign Petition').

To help mediate the struggles, I decided to perform analysis on this space, with particular attention to our posts.  The nice thing about being a statistical analyst is that you can take almost any topic that you can think about and if you have a decent knowledge of it, can run useful analysis to drive successful results.  My pattern was simple...

Decide the purpose of the analysis-->Determine available data-->Determine the best approach to analyze the data-->Prepare and clean the data-->Perform analysis, interpretation, and provide results in a geared-to-my-audience format.

In regard to providing results in geared-to-my-audience format, I had to have my end audience in mind.  I work with an amazing group of people who are dedicated to DDTTRH, yet their backgrounds differ.  One is an IT person, another an artist (our publicist), and another I see with skills in project management.  When I wrote up my analysis, I had to think of them and the best way to show them the kind of results that could lead to actionable items to improve the engagement of our fans.

Another exciting part of this analysis was determining the kind of data (and any limitations) that Facebook has to analyze posts.  One limitation is that I could not go beyond a year, for example (I was hoping to go back to inception).  Also, there were certain assumptions I had to make based on the kind of data that was available and how far it would go to explain a person's behavior.  For example, I had to consider a 'like' or 'comment' as an engagement, though it could be argued that someone could read a posted article and not like or comment at all.

Also, I was hoping to do a linear regression, but because of the data limitation I decided to do just a regular trends analysis and present results in a meaningful way that would give us a guide or shooting a little closer to target.  And this is something I share with people I do analysis for.  Sometimes analysis (particularly for marketing) does not have to be perfect and extremely refined the first time out.  If you think of a dart board, sometimes you could be shooting off the board and scoring no points and a fair initial analysis could get you at least on the board closer to target.  Then, as you start to understand the data, its limitations, etc, you can find additional data sources to 'light your way' and can schedule to hold necessary data for more refined analysis next time.

As part of the analysis as well, I did an experiment where I posted something that I guessed (based on my analysis) would generate engagement, and I'm happy to say that my experiment was a success.
"If you think of a dart board, sometimes you could be shooting off the board and scoring no points and a fair initial analysis could get you at least on the board closer to target."

So, the end result once I compiled all my results was to share the results with my staff and to discuss any follow-up actions.  So far, the results have been good.  Not only were we able to discover what kind of posts would spark engagement of our fans, but also, because I added a time dimension, we were able to discover peak times where fans could be most engaged.  I have seen noticeable improvements in the quality of our posts and in additional fan traffic because of using this analysis to our benefit.

If you would like to see this analysis, please email me at enduranalyst@yahoo.com and I can Google Docs it to you.

I for one really enjoyed doing it.  Analyzing something different opened up a perspective and adventure that I would not normally have found if I didn't try it.  So get out there analysts!  Analyze something you enjoy and take an adventure...you will find it healthy and meaningful, I promise!

Kirk

Tuesday, May 1, 2012

The Variance Inflation Factor (VIF)

 By Kirk Harrington, special thanks to Benjamin Nutter (my biostats friends and partner in SAEG)

To check and correct for multi-collinearity in a linear-type regression, a great tool to use is the variance inflation factor.  Here's how to do it.

First, run each predictor against every other predictor in your model.  After you do this, you will get an R-Squared for each.  Then, calculation a VIF for each R-Squared (which corresponds to a given predictor).  The formula for the VIF is VIF=1/1-R-squared.

Then determine a suitable target VIF.  For example, a good target for you industry might be 5 or 3 (if you need more refinement).  A biostats friend of mine said you can even go up to 10--depending on the level of precision that is required in your field.  An engineer or pharmaceutical science may want 3, whereas someone in credit risk may want 5.  Someone in marketing may be willing to accept 10.

To correct for multicollinearity, start by taking out the variable with the highest VIF (that is more than your target VIF).  Then, rerun the VIF for each predictor again (you should see the VIFs declining).  Repeat this process until your variables are all within your target VIF.

------------------

An Effective Analyst Thought

A good tool that you can build for use in your modeling adventures is an inventory of the types of predictors that you ever found useful to you (or interesting to you) as you have built or validated models.  You can build a spreadsheet that says the name of the predictor, what type of model it was used in, any formulas associated with it, and what type of effect it was trying to explain in the model.  Once you build this inventory up enough, this is great to use to get ideas as you build other models, to provide ideas to improve models after validating them, to talk about during job interviews, or to share with other analysts and their modeling efforts.

Thursday, April 19, 2012

The more data from different source you get, the better you can understand the picture

 By Kirk Harrington

Lets say I give you the part of a picture...





Analyzing this piece alone you could say that it has dark colors, varying between black and brown.

If I gave you more....











You would see there's more to the picture.  There is some sort of extension coming from the darkness, forming a tubular pattern with wrinkled something...is this wood...is this fabric...you don't really know because you can't see more beyond it.

But if I gave you more...












You would see that the tubular extension is actually attached to hands...so you know its fabric now and that the picture must be of a person.  But are these a woman's hands or man's?  Where is this person?  Without more information, you wouldn't know.  You can also tell now more than before that this is a painting.

Let me give you more...














This picture certainly gives you more.  This looks like this could be more of a woman than a man, especially since the hair comes down to below the neckline.  And, it looks like this person is outside (given what looks to be mountains and a path in the background)...or could they be on a balcony?  Is this person happy or sad?  You can't tell without seeing more.

Now, if I give you the full picture...
 ...you would recognize it as the famous picture Mona Lisa...posing happily for Leonardo Da Vinci.  Or would you really know she was posing for him since you don't see him?  You'd have to do more research to find out.  You've only heard that, but you ought to verify it with outside data.

The point of this excercise is to show that the more data you have about something, the more you know what it really looks like and how you can predict what it is.  The more data sources you can draw from, the more that you can get a more powerful view of what you're trying to predict.

For example...

With Credit risk data, you could look at credit scores alone...but add to that customer behavior, demographics, geography, and economic environment, and you can get a more clear picture of what might be causing a default or delinquency.

With marketing data, you could learn something about a customer's age and income, but what about their preferences, where they typically shop, their shopping behavior, seasonal patterns of their shopping behavior, what their family situation is (i.e. do they have children, are they married?).  If you're trying to predict if they purchase your product, the more information you have about them from various sources, the better you can predict the kind of factors that affect that purchase.

~~~~~~~~~~~
Effective Analyst Thought....

Surprise your client or customer with a sense of humor.  A little ice-breaking can go a long way towards them accepting you as a valued co-worker AND it can go a long way towards them accepting your analysis as valuable.

Wednesday, April 11, 2012

Dependent variable normalization 'in time'


 By Kirk Harrington

The dependent variable in a model is extremely important, however its accuracy can sometimes be overlooked (particularly if it has to do with timing and if it is to explain behavior).  What is predicted is a key part of the model and it drives the results that ultimately affect any predictions that come out.  An incorrectly made dependent variable can also affect the accuracy of the independent variables that are estimated off of it.  Consider a dependent variable for a default model (logistic time series):



If your object is to explain account behavior leading up to the end of the behavior cycle point, the end of the behavior cycle point would be the best timing for the dependent variable (in this example it could be the last episode of an account going into 90 days before it starts the default cycle of >90 days).  This would vary by account, however the independent variables leading up to this end of behavior cycle point would be most reflective of the account behavior at that 'time'.

If you take the dependent variable farther out, say to when an account falls off the books, this is neither in the account's control or in the organization's control (the organization doing the charge-off) and any appended independent variables meant to predict the dependent variable would have trouble coalescing around positive and negative correlations with the dependent variable.

After the point of 'no control' (in this case when an account falls off the books during the default cycle), many things can happen--there can be a 'blitz' of charge-offs at the end of a given quarter, month, or year.  It can take forever for a charge-off to finally happen because of legal matters and processing.  The organization may not have a set policy for charge-offs, therefore the timing of when it happens can vary widely.  'Normalizing' the dependent variable around the end of behavior cycle point vs. when it 'fall off the books' will improve your model's ability to capture account behavior before it reaches the point of 'no control'.

~~~
Effective Analyst Thought:
If you are in the position to consult with a client as an analyst, don't just ask them how they want something measured and not question it (especially if the person you may be dealing with does not have a modeling background).  Ask more questions, learn their data a bit more, and work together with the client to determine the best measurement that is statistically sound.  It is better to hash this out in the beginning vs. later after the
model is built and already being used.

Monday, March 12, 2012

Segmentation code cross test technique

 By Kirk Harrington

With model validation or in the creation of segmentation code there is always the chance that code use to identify unique groups (or segments) is actually not mutually exclusive.  Here is a technique that I created to determine if this is a problem in your code. 

Copy and paste the code used to identify mutually exclusive groups to another section of your syntax (and when I say syntax, I mean you are working in a program like SAS ot SPSS).

Prior to the copied code, create g1 to gn+1 (however many groups you have plus one--the plus one covers a 'not assigned' by your codes category).  Then, instead of the group='group_name' variable in your pasted code from the previous step, make each group g1 to gn (how ever many groups) equal to 1.  Then create another line of code which will cover your 'not assigned' group.  Something like:  if g1=0 and g2=0 and g3=0 and .....gn=0 gn+1=1.

Run all the code.  You will have created g1 to gn+1 variables for each line in your dataset.  Now, change each g1 to gn+1 variable into a text (or string) 1 variable.  Create a concatenated field with these variables and run out a frequency table.  You will see results like this...

Example, 7 groups, 6 assigned, 1 not assigned

0100000   n
0010000   n

0110000   n
0001000  n
0000100  n
0000010  n
0000001  n

 Can you guess which code set (based on place holder) has problems with non-exclusiveness?  It would be code in placeholders 2 & 3.  Now that you know this, you can go back and troubleshoot those specific groups to determine where coding issues exist.  After you fix your code, you can re-run this test to see if your fixes worked.

------------------
An Effective Analyst Thought

There's nothing worse than a manager and analyst at odds.  The analyst is usually in the inferior position.  The temptation of the analyst may be to think that they are better, smarter, and that they can run things better than their manager.  Its really not that simple.  If you are at odds with your manager, find ways to improve your communications lines.  Otherwise, you will have a one-way ticket to looking bad and even if you could do a better job than your manager (which is not wrong to think), you may not get the chance one day to show it because of your behavior.

Saturday, February 18, 2012

An example of a Nested SQL statment


Seven of Nine, from the Star Trek Series, Voyager

 Article by Kirk Harrington

In followup to the recent SQL discussion I posted, here's a Nested SQL example that actually works but thats sure to confound you!  Can you guess what its doing?

select distinct j.index_nbr_b, j.index_b_stardt, j.index_b_amt, j.index_nbr_h as cube_nbr_h, j.index_h_stardt as cube_h_stardt, j.index_h_code as cube_trcd, j.index_h_amt as ic_tran_amt, j.hflag_stardt, j.Attrib, j.index_nbr_h as seven_of_nine_nbr, j.seven_of_nine_stardt, SUM(latinum_bars) as sum_latinum_bars, SUM(closeout dollars) as sum_credit_chips, SUM(borg_dollars) as sum_borg_dollars from
 (
select g.*,'latinum_bars'=case when seven_trcd='LB' and seven_attrib=1 then rif_amt else 0 end,
'credit_chips'=case when seven_trcd='CC' and seven_attrib=1 then rif_amt else 0 end,
'borg_dollars'=case when seven_trcd='BD' and seven_attrib=1 then rif_amt else 0 end

from
(
(select *, 'seven_attrib'=case when pmt_dt between index_h_stardt and seven_of_nine_stardt then 1 else 0 end from
(
((select v.*, index_b_stardt+7 as seven_of_nine_stardt from
(select z.*, 'Attrib'=case when z.index_h_stardt between z.hflag_stardt and z.index_b_stardt and index_b_amt=index_h_amt then 1 else 0 end
from

(select *, index_b_stardt - 30 as hflag_stardt from

((select index_nbr_b,index_b_stardt,index_b_amt
from watb.borg_queen where index_nbr_b between 7777777 and 9999999 and index_b_stardt between 7411.4 and 9521.6) a

FULL OUTER JOIN

(select index_nbr_h, index_h_stardt, index_h_code, index_h_amt from watb.seven_of_nine where index_nbr_h between 7777777 and 9999999 and substring(index_h_code,1,3)='IC@' and index_h_stardt between 6411.4 and 9521.6) b
on a.index_nbr_b=b.index_nbr_h))x) v where v.Attrib=1)) o

LEFT OUTER JOIN

(select index_nbr_h as index_nbr_h, SUBSTRING(index_h_code,1,2) as seven_trcd, index_h_amt as rif_amt, index_h_stardt as pmt_dt from watb.seven_of_nine where substring(index_h_code,1,2) in ('LB','CC', 'BD') and
index_h_stardt between 6411.4 and 9521.6 and index_h_amt>0 and index_nbr_h between 7777777 and 9999999) m
on 0.index_nbr_b=m.index_nbr_h)

)) g
) j

GROUP BY index_nbr_b, index_b_stardt, index_b_amt. index_nbr_h, index_h_stardt, index_h_code, index_h_amt, hflag_stardt, Attrib, index_nbr_h, seven_of_nine_stardt

ORDER BY index_nbr_b, index_b_stardt

-----------------------------------------------

An Effective Analyst Tip
When a business line asks you for A,B, and C, also provide them D,E, and F.  This means that you provide them more than what they're asking for so they see you are engaged in wanting to help them succeed.

Saturday, February 4, 2012

A Valuable Lesson I Learned doing a Credit Default Model

 By Kirk Harrington

As some of you know, I have worked on Marketing and Credit Risk Models.  This past week, I was finishing up work on a credit default model (to come up with predictions for Allowance for Loan Loss provision) and learned a valuable lesson.

The lesson was this...I had prepared a report, with average probabilities by Risk group, < a given % and more than a given percent (basically high and low risk).  I also split these into delinquency groups (one group was <30 delinquency and the other 30 or 60--I was modeling to find out who reached >=90 in a 12 month time window).

One thing that was odd about the average probabilities was that the 30 or 60 delinquency group's probability seemed very low.  In fact, when I ran out the actual losses by account and dollar amount, something seemed off.  I was predicting a very low number of accounts for a group that had much higher actuals.

This led me to create a decision tree (I was using CHAID for this excercise) on the two delinquency groups (one for each group).  What I found was that 5 of the 8 predictors were being used in the < 30 delinquency group and 0/8 predictors in the 30 and 60 group.  This was a clue that these two populations were VERY different.  Further, it was obvious that the average probabilities across nodes were sharply different (the probabilities of going bad were MUCH higher for the 30-60 delinquency groups).

What was happening (and I verified this with other numbers) is that these special delinquency groups' probabilities were being driven by the 'majority' node by node.  In most nodes, the population proportions were about 99 to 1 (or 99% < 30, and 1% in 30 to 60).

This was clearly an example of a population 'within' the majority that was fouling up the probabilities in my end report.  To rectify this, I simply created a 'forced-split' tree, with this delinquency split at the top.  The probabilities came out much more stable and all my model diagnosis statistics improved as well (i.e. KS, Gini).




Saturday, January 14, 2012

The Anatomy of a SQL Query

 By Kirk Harrington, Partner, Statistical Analyst Effectiveness Group (SAEG)

A useful programming code to understand as an analyst is SQL.  Many interfaces to data ask you to write SQL to query and extract data.  There are different types of SQL which are used on different platforms (a topic not discussed in this blog), however the general format is similar across all formats.

The below represent very simple diagrams that to represent the 'anatomy' of a SQL query (three basic structures shown for you).  If you understand the basic structure, you know how to build a query and how to break it down and troubleshoot it for problems.

In addition to this blog, SAEG can make available to you Basic, Intermediate, and Intermediate II SQL lessons that anyone can self teach themselves.  Please just SAEG at enduranalyst@yahoo.com and we can send you those via google docs.  If you would like a human-taught lesson, SAEG can arrange that for a small fee.   In my mind, you don't need a special class or expensive lessons from a vendor to learn SQL.  Its pretty easy to learn and can give you an edge as an analyst if you know it.

The Anatomy of a SQL Query, 3 typical structures

1) Just a simple select, one table

SELECT
<field(s)> or<*>

FROM
<table(s)>


WHERE
<filter statement if desired>

GROUP BY <used in only certain cases, i.e. when doing a sum or just grouping>
ORDER BY <used when you want to 'sort' by a given field>
 
2)  Structure using an 'explicit' join, two tables

SELECT
<field(s)> or <*>

FROM
<table(s)>

<name your joins, i.e. LEFT OUTER JOIN, INNER JOIN, FULL JOIN>


ON <field on first table = field on join to table>

WHERE
<filter statement if desired>

GROUP BY <used in only certain cases, i.e. when doing a sum or just grouping>
ORDER BY <used when you want to 'sort' by a given field>

3)  Structure using an 'implict' join (no naming of joins required), two tables

SELECT
<field(s)> or <*>

FROM
<table(s)>

WHERE
<field on first table = field on join to table>

AND
<filter statement if desired>

GROUP BY <used in only certain cases, i.e. when doing a sum or just grouping>

ORDER BY <used when you want to 'sort' by a given field>~~~~

Other notes:

The * tells the query to bring back every field that is on the given table you select

When joining two or more tables, you use these things called 'aliases' that 'identify' a given field and table where its from.  For more on that, please ask for our SQL lessons to learn more.  Many thanks!

Examples:

Explicit join

Select
id_table.id_code,
id_table.user_name,
volumes_table.volume_produced,
volumes_table.geo_location

from
id_table

left outer join

volumes_table

on id_table.id_code=volumes.id_code
where geo_location='Cleveland'

Implicit join

Select
id_table.id_code,
id_table.user_name,
volumes_table.volume_produced,
volumes_table.geo_location

from

id_code, geo_location
where id_table.id_code=volumes_table.id_code and volumes_table.geo_location='Cleveland'

Notes:
In the above examples, it is assumed that there is an id_code on the volumes table.  Also, notice how the table names are used as a prefix to the fields called for in the joins.  I can use aliases as well instead of the table names (if I wanted to use less code), for example:

Explicit with aliases

Select
a.id_code,
a.user_name,
b.volume_produced,
b.geo_location

from
id_table a

left outer join

volumes_table b

on a.id_code=b.id_code
where geo_location='Cleveland'

Implicit join with aliases

Select
a.id_code,
a.user_name,
b.volume_produced,
b.geo_location

from

id_code a, geo_location b
where a.id_code=b.id_code and a.geo_location='Cleveland'

An Effective Analyst thought...

When you're talking to someone about your analysis, how you ask if someone is understanding something can make you come off in either an 'I-know-more-than-you' or 'I-am-equal-to-you' attitude.  Here's a SAEG suggestion to help with this....

Instead of "Do you understand?", try "Am I making sense?"