R Package 'smbinning': Optimal Binning for Scoring Modeling

R Package 'smbinning': Optimal Binning for Scoring Modeling

by Herman Jopia

标签(空格分隔): RPackage

R Package 'smbinning': Optimal Binning for Scoring Modeling, by Herman Jopia


What is Binning?

Binning is the term used in scoring modeling for what is also known in Machine Learning as Discretization, the process of transforming a continuous characteristic into a finite number of intervals (the bins), which allows for a better understanding of its distribution and its relationship with a binary variable. The bins generated by the this process will eventually become the attributes of a predictive characteristic, the key component of a Scorecard.

Why Binning?

Though there are some reticence to it [1], the benefits of binning are pretty straight forward:

  • It allows missing data and other special calculations (e.g. divided by zero) to be included in the model.
  • It controls or mitigates the impact of outliers over the model.
  • It solves the issue of having different scales among the characteristics, making the weights of the coefficients in the final model comparable.

Unsupervised Discretization

Unsupervised
Discretization divides a continuous feature into groups (bins) without
taking into account any other information. It is basically a partiton
with two options: equal length intervals and equal frequency intervals.

Equal length intervals

  • Objective: Understand the distribution of a variable.
  • Example: The classic histogram, whose bins have equal length that can be calculated using different rules (Sturges, Rice, and others).
  • Disadvantage: The number of records in a bin may be too small to allow for a valid calculation, as shown in Table 1.


Table 1. Time on Books and Credit Performance. Bin 6 has no bads, producing indeterminate metrics.

Equal frequency intervals

  • Objective: Analyze the relationship with a binary target variable through metrics like bad rate.
  • Example: Quartlies or Percentiles.
  • Disadvantage: The cutpoints selected may not maximize the difference between bins when mapped to a target variable, as shown in Table 2


Table 2. Time on Books and Credit Performance. Different cutpoints may improve the Information Value (0.4969).

Supervised Discretization

Supervised Discretization divides a continuous feature into groups
(bins) mapped to a target variable. The central idea is to find those
cutpoints that maximize the difference between the groups.
In the
past, analysts used to iteratively move from Fine Binning to Coarse
Binning, a very time consuming process of finding manually and visually
the right cutpoints (if ever). Nowadays with algorithms like ChiMerge or
Recursive Partitioning, two out of several techniques available [2],
analysts can quickly find the optimal cutpoints in seconds and evaluate
the relationship with the target variable using metrics such as Weight
of Evidence and Information Value.

An Example With 'smbinning'

Using the 'smbinning' package and its data (chileancredit), whose documentation can be found on its supporting website, the characteristic Time on Books is grouped into bins taking into account the Credit Performance (Good/Bad) to establish the optimal cutpoints to get meaningful and statistically different groups. The R code below, Table 3, and Figure 1 show the result of this application, which clearly surpass the previous methods with the highest Information Value (0.5353).

# Load package and its data 
library(smbinning) 
data(chileancredit) 
# Training and testing samples 
chileancredit.train=subset(chileancredit,FlagSample==1) 
chileancredit.test=subset(chileancredit,FlagSample==0) 
# Run and save results 
result=smbinning(df=chileancredit.train,y="FlagGB",x="TOB",p=0.05) 
result$ivtable
 
# Relevant plots (2x2 Page) 
par(mfrow=c(2,2)) 
boxplot(chileancredit.train$TOB~chileancredit.train$FlagGB, 
horizontal=T, frame=F, col="lightgray",main="Distribution") 
mtext("Time on Books (Months)",3) 
smbinning.plot(result,option="dist",sub="Time on Books (Months)") 
smbinning.plot(result,option="badrate",sub="Time on Books (Months)") 
smbinning.plot(result,option="WoE",sub="Time on Books (Months)")


Table 3. Time on Books cutpoints mapped to Credit Performance.


Figure 1. Plots generated by the package.

In the middle of the "data era", it is critical to speed up the development of scoring models. Binning, and more specifically, automated binning helps to reduce significantly the time consuming process of generating predictive characteristics, reason why companies like SAS and FICO have developed their own proprietary algorithms to implement this functionality on their respective software. For analysts who do not have these specific tools or modules, the R package 'smbinning' offers an statistically robust alternative to run their analysis faster.

For more information about binning, the package's documentation available on CRAN lists some references related to the algorithm behind it and its supporting website some references for scoring modeling development.

References

[1] Dinero, T. (1996) Seven Reasons Why You Should Not Categorize Continuous Data. Journal of health & social policy 8(1) 63-72 (1996).
[2] Garcia, S. et al (2013) A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning. IEEE Transactions on Knowledge and Data Engineering, Vol. 25, No. 4, April 2013.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 194,524评论 5 460
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 81,869评论 2 371
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 141,813评论 0 320
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 52,210评论 1 263
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 61,085评论 4 355
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 46,117评论 1 272
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 36,533评论 3 381
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 35,219评论 0 253
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 39,487评论 1 290
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 34,582评论 2 309
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 36,362评论 1 326
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 32,218评论 3 312
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 37,589评论 3 299
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 28,899评论 0 17
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 30,176评论 1 250
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 41,503评论 2 341
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 40,707评论 2 335

推荐阅读更多精彩内容