Saturday, February 18, 2012

An example of a Nested SQL statment


Seven of Nine, from the Star Trek Series, Voyager

 Article by Kirk Harrington

In followup to the recent SQL discussion I posted, here's a Nested SQL example that actually works but thats sure to confound you!  Can you guess what its doing?

select distinct j.index_nbr_b, j.index_b_stardt, j.index_b_amt, j.index_nbr_h as cube_nbr_h, j.index_h_stardt as cube_h_stardt, j.index_h_code as cube_trcd, j.index_h_amt as ic_tran_amt, j.hflag_stardt, j.Attrib, j.index_nbr_h as seven_of_nine_nbr, j.seven_of_nine_stardt, SUM(latinum_bars) as sum_latinum_bars, SUM(closeout dollars) as sum_credit_chips, SUM(borg_dollars) as sum_borg_dollars from
 (
select g.*,'latinum_bars'=case when seven_trcd='LB' and seven_attrib=1 then rif_amt else 0 end,
'credit_chips'=case when seven_trcd='CC' and seven_attrib=1 then rif_amt else 0 end,
'borg_dollars'=case when seven_trcd='BD' and seven_attrib=1 then rif_amt else 0 end

from
(
(select *, 'seven_attrib'=case when pmt_dt between index_h_stardt and seven_of_nine_stardt then 1 else 0 end from
(
((select v.*, index_b_stardt+7 as seven_of_nine_stardt from
(select z.*, 'Attrib'=case when z.index_h_stardt between z.hflag_stardt and z.index_b_stardt and index_b_amt=index_h_amt then 1 else 0 end
from

(select *, index_b_stardt - 30 as hflag_stardt from

((select index_nbr_b,index_b_stardt,index_b_amt
from watb.borg_queen where index_nbr_b between 7777777 and 9999999 and index_b_stardt between 7411.4 and 9521.6) a

FULL OUTER JOIN

(select index_nbr_h, index_h_stardt, index_h_code, index_h_amt from watb.seven_of_nine where index_nbr_h between 7777777 and 9999999 and substring(index_h_code,1,3)='IC@' and index_h_stardt between 6411.4 and 9521.6) b
on a.index_nbr_b=b.index_nbr_h))x) v where v.Attrib=1)) o

LEFT OUTER JOIN

(select index_nbr_h as index_nbr_h, SUBSTRING(index_h_code,1,2) as seven_trcd, index_h_amt as rif_amt, index_h_stardt as pmt_dt from watb.seven_of_nine where substring(index_h_code,1,2) in ('LB','CC', 'BD') and
index_h_stardt between 6411.4 and 9521.6 and index_h_amt>0 and index_nbr_h between 7777777 and 9999999) m
on 0.index_nbr_b=m.index_nbr_h)

)) g
) j

GROUP BY index_nbr_b, index_b_stardt, index_b_amt. index_nbr_h, index_h_stardt, index_h_code, index_h_amt, hflag_stardt, Attrib, index_nbr_h, seven_of_nine_stardt

ORDER BY index_nbr_b, index_b_stardt

-----------------------------------------------

An Effective Analyst Tip
When a business line asks you for A,B, and C, also provide them D,E, and F.  This means that you provide them more than what they're asking for so they see you are engaged in wanting to help them succeed.

Saturday, February 4, 2012

A Valuable Lesson I Learned doing a Credit Default Model

 By Kirk Harrington

As some of you know, I have worked on Marketing and Credit Risk Models.  This past week, I was finishing up work on a credit default model (to come up with predictions for Allowance for Loan Loss provision) and learned a valuable lesson.

The lesson was this...I had prepared a report, with average probabilities by Risk group, < a given % and more than a given percent (basically high and low risk).  I also split these into delinquency groups (one group was <30 delinquency and the other 30 or 60--I was modeling to find out who reached >=90 in a 12 month time window).

One thing that was odd about the average probabilities was that the 30 or 60 delinquency group's probability seemed very low.  In fact, when I ran out the actual losses by account and dollar amount, something seemed off.  I was predicting a very low number of accounts for a group that had much higher actuals.

This led me to create a decision tree (I was using CHAID for this excercise) on the two delinquency groups (one for each group).  What I found was that 5 of the 8 predictors were being used in the < 30 delinquency group and 0/8 predictors in the 30 and 60 group.  This was a clue that these two populations were VERY different.  Further, it was obvious that the average probabilities across nodes were sharply different (the probabilities of going bad were MUCH higher for the 30-60 delinquency groups).

What was happening (and I verified this with other numbers) is that these special delinquency groups' probabilities were being driven by the 'majority' node by node.  In most nodes, the population proportions were about 99 to 1 (or 99% < 30, and 1% in 30 to 60).

This was clearly an example of a population 'within' the majority that was fouling up the probabilities in my end report.  To rectify this, I simply created a 'forced-split' tree, with this delinquency split at the top.  The probabilities came out much more stable and all my model diagnosis statistics improved as well (i.e. KS, Gini).