Scorecard Development with Data from Multiple Sources
This article is originally published at https://statcompute.wordpress.com
This week, one of my friends asked me a very interesting and practical question in the scorecard development. The model development data were collected from multiple independent sources with various data sizes, heterogeneous risk profiles and different bad rates. While the performance statistics seem satisfactory on the model training dataset, the model doesn’t generalize well with new accounts that might come from a unknown source. The situation is very common in a consulting company where a risk or marketing model is sometimes developed with the data from multiple organizations.
To better understand the issue, I simulated a dataset consisting of two groups. In the dataset, while x0 and x1 govern the group segmentation, x2 and x3 define the bad definition. It is important to point out that the group information “grp” is only known in the model development sample but is unknown in the production population. Therefore, the variable “grp”, albeit predictive, can not be explicitly used in the model estimation.
data one; do i = 1 to 100000; x0 = ranuni(0); x1 = ranuni(1); x2 = ranuni(2); x3 = ranuni(3); if 1 + x0 * 2 + x1 * 4 + rannor(1) > 5 then do; grp = 1; if x2 * 2 + x3 * 4 + rannor(2) > 5 then bad = 1; else bad = 0; end; else do; grp = 0; if x2 * 4 + x3 * 2 + rannor(3) > 4 then bad = 1; else bad = 0; end; output; end; run;
Our first approach is to use all variables x0 – x3 to build a logistic regression and then evaluate the model altogether and by groups.
proc logistic data = one desc noprint; model bad = x0 x1 x2 x3; score data = one out = mdl1 (rename = (p_1 = score1)); run; GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1 MAXIMUM KS = 59.5763 AT SCORE POINT 0.2281 ( AUC STATISTICS = 0.8800, GINI COEFFICIENT = 0.7599, DIVERGENCE = 2.6802 ) MIN MAX GOOD BAD TOTAL BAD CUMULATIVE BAD CUMU. BAD SCORE SCORE # # # RATE BAD RATE PERCENT PERCENT -------------------------------------------------------------------------------------------------------- BAD 0.6800 0.9699 2,057 7,943 10,000 79.43% 79.43% 33.81% 33.81% | 0.4679 0.6799 4,444 5,556 10,000 55.56% 67.50% 23.65% 57.46% | 0.3094 0.4679 6,133 3,867 10,000 38.67% 57.89% 16.46% 73.92% | 0.1947 0.3094 7,319 2,681 10,000 26.81% 50.12% 11.41% 85.33% | 0.1181 0.1946 8,364 1,636 10,000 16.36% 43.37% 6.96% 92.29% | 0.0690 0.1181 9,044 956 10,000 9.56% 37.73% 4.07% 96.36% | 0.0389 0.0690 9,477 523 10,000 5.23% 33.09% 2.23% 98.59% | 0.0201 0.0389 9,752 248 10,000 2.48% 29.26% 1.06% 99.64% V 0.0085 0.0201 9,925 75 10,000 0.75% 26.09% 0.32% 99.96% GOOD 0.0005 0.0085 9,991 9 10,000 0.09% 23.49% 0.04% 100.00% ========== ========== ========== ========== ========== 0.0005 0.9699 76,506 23,494 100,000 GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1(WHERE = (GRP = 0)) MAXIMUM KS = 61.0327 AT SCORE POINT 0.2457 ( AUC STATISTICS = 0.8872, GINI COEFFICIENT = 0.7744, DIVERGENCE = 2.8605 ) MIN MAX GOOD BAD TOTAL BAD CUMULATIVE BAD CUMU. BAD SCORE SCORE # # # RATE BAD RATE PERCENT PERCENT -------------------------------------------------------------------------------------------------------- BAD 0.7086 0.9699 1,051 6,162 7,213 85.43% 85.43% 30.51% 30.51% | 0.5019 0.7086 2,452 4,762 7,214 66.01% 75.72% 23.58% 54.10% | 0.3407 0.5019 3,710 3,504 7,214 48.57% 66.67% 17.35% 71.45% | 0.2195 0.3406 4,696 2,517 7,213 34.90% 58.73% 12.46% 83.91% | 0.1347 0.2195 5,650 1,564 7,214 21.68% 51.32% 7.74% 91.66% | 0.0792 0.1347 6,295 919 7,214 12.74% 44.89% 4.55% 96.21% | 0.0452 0.0792 6,737 476 7,213 6.60% 39.42% 2.36% 98.56% | 0.0234 0.0452 7,000 214 7,214 2.97% 34.86% 1.06% 99.62% V 0.0099 0.0234 7,150 64 7,214 0.89% 31.09% 0.32% 99.94% GOOD 0.0007 0.0099 7,201 12 7,213 0.17% 27.99% 0.06% 100.00% ========== ========== ========== ========== ========== 0.0007 0.9699 51,942 20,194 72,136 GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1(WHERE = (GRP = 1)) MAXIMUM KS = 53.0942 AT SCORE POINT 0.2290 ( AUC STATISTICS = 0.8486, GINI COEFFICIENT = 0.6973, DIVERGENCE = 2.0251 ) MIN MAX GOOD BAD TOTAL BAD CUMULATIVE BAD CUMU. BAD SCORE SCORE # # # RATE BAD RATE PERCENT PERCENT -------------------------------------------------------------------------------------------------------- BAD 0.5863 0.9413 1,351 1,435 2,786 51.51% 51.51% 43.48% 43.48% | 0.3713 0.5862 2,136 651 2,787 23.36% 37.43% 19.73% 63.21% | 0.2299 0.3712 2,340 446 2,786 16.01% 30.29% 13.52% 76.73% | 0.1419 0.2298 2,525 262 2,787 9.40% 25.07% 7.94% 84.67% | 0.0832 0.1419 2,584 202 2,786 7.25% 21.50% 6.12% 90.79% | 0.0480 0.0832 2,643 144 2,787 5.17% 18.78% 4.36% 95.15% | 0.0270 0.0480 2,682 104 2,786 3.73% 16.63% 3.15% 98.30% | 0.0140 0.0270 2,741 46 2,787 1.65% 14.76% 1.39% 99.70% V 0.0058 0.0140 2,776 10 2,786 0.36% 13.16% 0.30% 100.00% GOOD 0.0005 0.0058 2,786 0 2,786 0.00% 11.84% 0.00% 100.00% ========== ========== ========== ========== ========== 0.0005 0.9413 24,564 3,300 27,864
As shown in the above output, while the overall model performance looks ok, it doesn’t generalize well in the dataset from the 2nd group with a smaller size. While the overall KS could be as high as 60, the KS for the 2nd group is merely 53. The reason is that the overall model performance is heavily influenced by the dataset from the 1st group with the larger size. Therefore, the estimated model is biased toward the risk profile reflected in the 1st group.
To alleviate the bias in the first model, we could first introduce a look-alike model driven by x0 – x1 with the purpose to profile the group and then build two separate risk models with x2 – x3 only for 1st and 2nd groups respectively. As a result, the final predicted probability should be the composite of all three sub-models, as shown below. The model evaluation is also provided to compared with the first model.
proc logistic data = one desc noprint; where grp = 0; model bad = x2 x3; score data = one out = mdl20(rename = (p_1 = p_10)); run; proc logistic data = one desc noprint; where grp = 1; model bad = x2 x3; score data = one out = mdl21(rename = (p_1 = p_11)); run; proc logistic data = one desc noprint; model grp = x0 x1; score data = one out = seg; run; data mdl2; merge seg mdl20 mdl21; by i; score2 = p_10 * (1 - p_1) + p_11 * p_1; run; GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2 MAXIMUM KS = 60.6234 AT SCORE POINT 0.2469 ( AUC STATISTICS = 0.8858, GINI COEFFICIENT = 0.7715, DIVERGENCE = 2.8434 ) MIN MAX GOOD BAD TOTAL BAD CUMULATIVE BAD CUMU. BAD SCORE SCORE # # # RATE BAD RATE PERCENT PERCENT -------------------------------------------------------------------------------------------------------- BAD 0.6877 0.9677 2,011 7,989 10,000 79.89% 79.89% 34.00% 34.00% | 0.4749 0.6876 4,300 5,700 10,000 57.00% 68.45% 24.26% 58.27% | 0.3125 0.4748 6,036 3,964 10,000 39.64% 58.84% 16.87% 75.14% | 0.1932 0.3124 7,451 2,549 10,000 25.49% 50.51% 10.85% 85.99% | 0.1142 0.1932 8,379 1,621 10,000 16.21% 43.65% 6.90% 92.89% | 0.0646 0.1142 9,055 945 10,000 9.45% 37.95% 4.02% 96.91% | 0.0345 0.0646 9,533 467 10,000 4.67% 33.19% 1.99% 98.90% | 0.0166 0.0345 9,800 200 10,000 2.00% 29.29% 0.85% 99.75% V 0.0062 0.0166 9,946 54 10,000 0.54% 26.10% 0.23% 99.98% GOOD 0.0001 0.0062 9,995 5 10,000 0.05% 23.49% 0.02% 100.00% ========== ========== ========== ========== ========== 0.0001 0.9677 76,506 23,494 100,000 GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2(WHERE = (GRP = 0)) MAXIMUM KS = 61.1591 AT SCORE POINT 0.2458 ( AUC STATISTICS = 0.8880, GINI COEFFICIENT = 0.7759, DIVERGENCE = 2.9130 ) MIN MAX GOOD BAD TOTAL BAD CUMULATIVE BAD CUMU. BAD SCORE SCORE # # # RATE BAD RATE PERCENT PERCENT -------------------------------------------------------------------------------------------------------- BAD 0.7221 0.9677 1,075 6,138 7,213 85.10% 85.10% 30.40% 30.40% | 0.5208 0.7221 2,436 4,778 7,214 66.23% 75.66% 23.66% 54.06% | 0.3533 0.5208 3,670 3,544 7,214 49.13% 66.82% 17.55% 71.61% | 0.2219 0.3532 4,726 2,487 7,213 34.48% 58.73% 12.32% 83.92% | 0.1309 0.2219 5,617 1,597 7,214 22.14% 51.41% 7.91% 91.83% | 0.0731 0.1309 6,294 920 7,214 12.75% 44.97% 4.56% 96.39% | 0.0387 0.0731 6,762 451 7,213 6.25% 39.44% 2.23% 98.62% | 0.0189 0.0387 7,009 205 7,214 2.84% 34.86% 1.02% 99.63% V 0.0074 0.0189 7,152 62 7,214 0.86% 31.09% 0.31% 99.94% GOOD 0.0002 0.0073 7,201 12 7,213 0.17% 27.99% 0.06% 100.00% ========== ========== ========== ========== ========== 0.0002 0.9677 51,942 20,194 72,136 GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2(WHERE = (GRP = 1)) MAXIMUM KS = 57.6788 AT SCORE POINT 0.1979 ( AUC STATISTICS = 0.8717, GINI COEFFICIENT = 0.7434, DIVERGENCE = 2.4317 ) MIN MAX GOOD BAD TOTAL BAD CUMULATIVE BAD CUMU. BAD SCORE SCORE # # # RATE BAD RATE PERCENT PERCENT -------------------------------------------------------------------------------------------------------- BAD 0.5559 0.9553 1,343 1,443 2,786 51.79% 51.79% 43.73% 43.73% | 0.3528 0.5559 2,001 786 2,787 28.20% 40.00% 23.82% 67.55% | 0.2213 0.3528 2,364 422 2,786 15.15% 31.71% 12.79% 80.33% | 0.1372 0.2213 2,513 274 2,787 9.83% 26.24% 8.30% 88.64% | 0.0840 0.1372 2,588 198 2,786 7.11% 22.42% 6.00% 94.64% | 0.0484 0.0840 2,683 104 2,787 3.73% 19.30% 3.15% 97.79% | 0.0256 0.0483 2,729 57 2,786 2.05% 16.84% 1.73% 99.52% | 0.0118 0.0256 2,776 11 2,787 0.39% 14.78% 0.33% 99.85% V 0.0040 0.0118 2,781 5 2,786 0.18% 13.16% 0.15% 100.00% GOOD 0.0001 0.0040 2,786 0 2,786 0.00% 11.84% 0.00% 100.00% ========== ========== ========== ========== ========== 0.0001 0.9553 24,564 3,300 27,864
After comparing KS statistics from two modeling approaches, we can see that, while the performance from the 2nd approach on the overall sample is only slightly better than the one from the 1st approach, the KS on the 2nd group with a smaller size, e.g. grp = 1, increases from 53 upto 58 by 8.6%. While the example is just for two groups, it is trivial to generalize in cases with more than two groups.
Thanks for visiting r-craft.org
This article is originally published at https://statcompute.wordpress.com
Please visit source website for post related comments.