基于R语言的申请评分卡

1.引言

信贷行业中常见的评分卡包括:申请评分卡(Application)、行为评分卡(Behavior)、催收评分卡(Collection)以及反欺诈评分卡(Anti-Fraud),简称为A卡、B卡、C卡和F卡。
A卡,主要应用于贷前准入环节对新用户的信用评级。
B卡,主要应用于贷中管理环节对存量用户的行为预测。
C卡,主要应用于贷后催收环节对存量用户是否催收的预测管理。
F卡,主要应用于贷前准入环节对新用户可能存在的欺诈行为进行预测。
本文通过历史数据建立Logistic回归模型,预测用户出现违约的概率,从而建立申请评分卡模型。
本文数据来自“klaR”包中的German credit data。

2.数据导入与观察

加载要用到的数据,并进行初步数据观察:

> library(klaR)                     #数据集包
> library(VIM)                      #缺失值可视化
> library(party)                    #随机森林
> library(InformationValue)         #求IV值
> library(smbinning)                #最优分段
> library(ggplot2)                  #可视化
> library(gridExtra)                #可视化
> library(woe)                      #求woe值
> library(car)                      #检验多重共线性
> library(pROC)
> data(GermanCredit)
> str(GermanCredit)
略
> summary(GermanCredit)
略

该数据集包含了1000个样本,每个样本包括21个变量,变量含义如下:

变量解释

3.数据清洗

数据清洗主要工作包括缺失值和异常值处理。

3.1 缺失值处理

查看缺失值情况:

> aggr(x = GermanCredit,prop=T,numbers=T,combined=F)
图1

从以上结果可看出,本数据集不存在缺失值。

3.2 异常值处理

查看定量指标异常值情况:

> quan_index <-c("duration","amount","installment_rate","present_residence",
+        "age","number_credits","people_liable")
> quan_vars <- GermanCredit[,quan_index]
> boxplot(scale(quan_vars),col="lightgray")
图2

从图2可以看出,定量指标中存在异常值。下面,让我们具体来看一下:

> table(boxplot.stats(quan_variables$duration)$out)

45 47 48 54 60 72 
 5  1 48  2 13  1 
> table(boxplot.stats(quan_variables$amount)$out)

 7966  7980  8065  8072  8086  8133  8229  8318  8335  8358  8386  8471  8487  8588 
    1     1     1     1     1     1     1     1     1     1     1     1     1     1 
 8613  8648  8858  8947  8978  9034  9055  9157  9271  9277  9283  9398  9436  9566 
    1     1     1     1     1     1     1     1     1     1     1     1     1     1 
 9572  9629  9857  9960 10127 10144 10222 10297 10366 10477 10623 10722 10875 10961 
    1     1     1     1     1     1     1     1     1     1     1     1     1     1 
10974 11054 11328 11560 11590 11760 11816 11938 11998 12169 12204 12389 12579 12612 
    1     1     1     1     1     1     1     1     1     1     1     1     1     1 
12680 12749 12976 13756 14027 14179 14318 14421 14555 14782 14896 15653 15672 15857 
    1     1     1     1     1     1     1     1     1     1     1     1     1     1 
15945 18424 
    1     1 
> table(boxplot.stats(quan_variables$age)$out)

65 66 67 68 70 74 75 
 5  5  3  3  1  4  2 
> table(boxplot.stats(quan_variables$number_credits)$out)

4 
6 
> table(boxplot.stats(quan_variables$people_liable)$out)

  2 
155 

根据具体情况来看,定量指标中存在的异常值是基本符合实际情况的,而且数据集样本数量较少,因此不对异常值做处理。(本例比较特殊,实际工作中的情况肯定会比较复杂)

4.特征变量选择

本数据集包含了定量和定性两类指标,接下来我们用不同的方法,筛选出对违约状态影响最大的指标,作为构建模型的变量。
首先,根据简单随机抽样,将数据集划分为训练集和测试集:

> set.seed(1234)
> GermanCredit$credit_risk <- ifelse(GermanCredit$credit_risk=="good",0,1)
> sam <- sample(nrow(GermanCredit), 800, replace = F)
> train <- GermanCredit[sam, ]
> test <- GermanCredit[-sam, ]

4.1 定量指标

以下用随机森林法和Logistic回归方法,寻找对因变量影响最显著的自变量:

> # 提取定量指标
> quant_vars<-c("duration","amount","installment_rate","present_residence","age",
+               "number_credits","people_liable","credit_risk")
> quant_data<-GermanCredit[,quant_vars]   
> # 随机森林法
> fit1 <- cforest(credit_risk~.,data = quant_data,controls = cforest_unbiased(mtry = 2, ntree = 50))
> # 调整变量间的相关系数,获取自变量的重要性
> sort(varimp(fit1,conditional = T),decreasing=T)
         duration               age            amount     people_liable  installment_rate 
     0.0046574142      0.0032983348      0.0028402490      0.0009693390      0.0007194029 
   number_credits present_residence 
     0.0005876221     -0.0001082808 
> # 调整样本变量不平衡性,获取自变量的重要性
> sort(varimpAUC(fit1),decreasing = T)
         duration            amount               age  installment_rate     people_liable 
     0.0185098322      0.0118307629      0.0089927377      0.0040057809      0.0026379584 
   number_credits present_residence 
     0.0011849564     -0.0009655443
> # Logistic回归
> fit2 <- glm(credit_risk~.,data = quant_data,family = binomial())
> fit2 <- step(fit2,trace = 0)
> summary(fit2)

Call:
glm(formula = credit_risk ~ duration + installment_rate + age, 
    family = binomial(), data = quant_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4943  -0.8464  -0.6890   1.2015   2.2806  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -1.427935   0.362767  -3.936 8.28e-05 ***
duration          0.037535   0.006553   5.728 1.02e-08 ***
installment_rate  0.158810   0.074329   2.137  0.03263 *  
age              -0.021756   0.007686  -2.831  0.00464 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
---
略

综合以上两种方法的结果,我们筛选出了对违约状态影响最显著的四个指标 :duration(3)、age(3)、amount(2)和installment_rate(2)。

4.2 定性指标

通过R中的informationvalue包,计算各指标的IV值,得到各定性指标间的重要性度量,选取其中的high predictive指标:

> # 提取定性指标
> qualt_vars<-c("status","credit_history","purpose","savings","employment_duration",
+                "personal_status_sex","other_debtors","property",
+  "other_installment_plans","housing","job","telephone","foreign_worker","credit_risk")
> qualt_data<-train[,qualt_vars]
> # 求指标iv值
> all_iv <- data.frame(vars=qualt_vars,iv=numeric(length(qualt_vars)),
+                      strength=character(length(qualt_vars)),stringsAsFactors = F)
> for (i in qualt_vars){
+   all_iv[all_iv$vars==i,]$iv <- IV(X=qualt_data[,i], Y=qualt_data$credit_risk)
+   all_iv[all_iv$vars==i,]$strength <- attr(IV(X=qualt_data[,i], Y=qualt_data$credit_risk),"howgood")
+   }
> (all_iv<-all_iv[order(-all_iv$iv),] )
                      vars         iv            strength
1                   status 0.62161414   Highly Predictive
2           credit_history 0.28847143   Highly Predictive
4                  savings 0.21358882   Highly Predictive
3                  purpose 0.21326499   Highly Predictive
8                 property 0.09828292 Somewhat Predictive
5      employment_duration 0.08935235 Somewhat Predictive
10                 housing 0.07040316 Somewhat Predictive
6      personal_status_sex 0.06871949 Somewhat Predictive
9  other_installment_plans 0.04371291 Somewhat Predictive
7            other_debtors 0.04350575 Somewhat Predictive
13          foreign_worker 0.04262905 Somewhat Predictive
12               telephone 0.01669733      Not Predictive
11                     job 0.01002765      Not Predictive
14             credit_risk 0.00000000      Not Predictive

根据以上结果,我们选择status、credit_history、savings和purpose四个high predictive指标构建模型。
综上,我们共选择了8个变量作为入模变量。

> # 入模指标
> quant_model_vars <- c("duration","amount","installment_rate","age")
> qualt_model_vars <- c("status","credit_history","savings","purpose")

5.WOE计算

5.1 变量分箱

5.1.1 定量指标

计算定量指标的WOE之前,需要先对定量指标进行分段。下面,优先采用最优分段,其原理是基于条件推理树(conditional inference trees, Ctree)的递归分割算法,核心算法用函数ctree()表示。

> # 对duration进行最优分段
> result1<-smbinning(df=train,y="credit_risk",x="duration",p=0.05)
> 查看分段效果
> smbinning.plot(result1,option="WoE",sub="Duration")
> breaks1 <- c(0,7,30,Inf)
> train$cut_duration <- cut(train$duration,breaks = breaks1)
图3

从上图可以看出,woe值相差较大,分段效果不错。以下针对amount、age采用相同分段方法。

> #对amount进行最优分段
> result2<-smbinning(df=train,y="credit_risk",x="amount")
> smbinning.plot(result2,option="WoE",sub="Amount")
> breaks2 <-  c(0,1372,3913,Inf)
> train$cut_amount <- cut(train$amount,breaks = breaks2)
> result3<-smbinning(df=train,y="credit_risk",x="age")
> smbinning.plot(result3,option="WoE",sub="Age")
> breaks3 <- c(0,25,Inf)
> train$cut_age <- cut(train$age,breaks = breaks3)
> train$cut_rate <- cut(train$installment_rate,4)
> rate_woe <- woe(Data = train,Independent ="cut_rate",Continuous = F,Dependent = "credit_risk",
+                 C_Bin = 4,Bad = 0,Good = 1)
> ggplot(rate_woe, aes(x = BIN, y = WOE)) + 
+   geom_bar(stat = "identity",fill = "blue", colour = "grey60",
+            size = 0.2, alpha = 0.2)+
+   labs(title = "等距分段")+
+   theme(plot.title = element_text(hjust = 0.5))

installment_rate只有1、2、3、4四个值,四个值对应的woe值差别较大,且具有单调性,采用等距分段。

图4
5.1.2 定性指标

接下来我们需要对定性指标做必要的降维处理,方便计算其WOE值。首先,我们查看一下入模的定性指标的概况:

> discrete_data<-train[,qual_model_vars]
> summary(discrete_data)
                                        status                                        credit_history
 ... < 100 DM                              :211   no credits taken/all credits paid back duly: 34   
 0 <= ... < 200 DM                         :214   all credits at this bank paid back duly    : 41   
 ... >= 200 DM / salary for at least 1 year: 50   existing credits paid back duly till now   :422   
 no checking account                       :325   delay in paying off in the past            : 66   
                                                  critical account/other credits existing    :237   
                                                                                                    
                                                                                                    
                       savings                   purpose   
 ... < 100 DM              :478   domestic appliances:216  
 100 <= ... < 500 DM       : 84   car (new)          :189  
 500 <= ... < 1000 DM      : 48   radio/television   :145  
 ... >= 1000 DM            : 41   car (used)         : 83  
 unknown/no savings account:149   others             : 83  
                                  retraining         : 39  
                                  (Other)            : 45 

由以上概况可知,定性指标status、credit_history、和savings的维数最高为5维,最低为4维,维数适中,可以不进行处理。
定性指标purpose的维数多于7维,明显高于其他定性指标。为了避免“维数灾难”,我们根据三条准则进行降维:1.维度间属性相似;2.合并后woe有明显变化;3.单个维度样本量不应过小。

> # 未进行降维前
> purpose_woe1 <- woe(Data = train,Independent ="purpose",Continuous = F,Dependent = "credit_risk",
+     C_Bin = 10,Bad = 0,Good = 1)
> ggplot(purpose_woe1, aes(x = BIN, y = WOE)) + 
+   geom_bar(stat = "identity",fill = "blue", colour = "grey60",size = 0.2, alpha = 0.2)+
+   labs(title = "Purpose")+
+   theme(plot.title = element_text(hjust = 0.5))
图5
> # 类似属性合并
> train <- within(train,{
+   cut_purpose <- NA
+   cut_purpose[purpose=="car (new)"] <- "car(new/used)"
+   cut_purpose[purpose=="car (used)"] <- "car(new/used)"
+   cut_purpose[purpose=="furniture/equipment"] <- "furniture/equipment/radio/television/domestic appliances"
+   cut_purpose[purpose=="radio/television"] <- "furniture/equipment/radio/television/domestic appliances"
+   cut_purpose[purpose=="domestic appliances"] <- "furniture/equipment/radio/television/domestic appliances"
+   cut_purpose[purpose=="repairs"] <- "repairs/business/others"
+   cut_purpose[purpose=="education"] <- "education/retraining"
+   cut_purpose[purpose=="retraining"] <- "education/retraining"
+   cut_purpose[purpose=="business"] <- "repairs/business/others"
+   cut_purpose[purpose=="others"] <- "repairs/business/others"})
> purpose_woe2 <- woe(Data = train,Independent ="cut_purpose",Continuous = F,Dependent = "credit_risk",
+                       C_Bin = 10,Bad = 0,Good = 1)
>   ggplot(purpose_woe2, aes(x = BIN, y = WOE)) + 
+     geom_bar(stat = "identity",fill = "blue", colour = "grey60",
+              size = 0.2, alpha = 0.2)+
+     labs(title = "Purpose")+
+     theme(plot.title = element_text(hjust = 0.5))+
+     theme(axis.text.x = element_text(vjust = 0.2, hjust = 0.2, angle = 10)
图6

5.2 WOE计算

用klaR包中的woe()函数获取入模变量的woe值。

> newtrain <- cbind(discrete_data[,-4],train[,c(21:26)])
> str(newtrain)
'data.frame':   800 obs. of  9 variables:
 $ status        : Factor w/ 4 levels "... < 100 DM",..: 4 4 4 1 4 4 2 3 3 4 ...
 $ credit_history: Factor w/ 5 levels "no credits taken/all credits paid back duly",..: 5 5 3 3 5 4 5 3 3 3 ...
 $ savings       : Factor w/ 5 levels "... < 100 DM",..: 1 1 1 1 1 1 1 1 1 2 ...
 $ credit_risk   : num  1 1 0 1 0 0 1 1 0 0 ...
 $ cut_duration  : Factor w/ 3 levels "(0,7]","(7,30]",..: 3 2 2 3 2 3 2 3 2 3 ...
 $ cut_amount    : Factor w/ 3 levels "(0,1.37e+03]",..: 3 2 2 2 2 3 3 3 1 3 ...
 $ cut_age       : Factor w/ 2 levels "(0,25]","(25,Inf]": 1 2 2 1 2 1 2 2 1 2 ...
 $ cut_rate      : Factor w/ 4 levels "(0.997,1.75]",..: 4 3 4 4 4 2 4 4 3 4 ...
 $ cut_purpose   : chr  "car(new/used)" "car(new/used)" "furniture/equipment/radio/television/domestic appliances" "furniture/equipment/radio/television/domestic appliances" ...
> newtrain$credit_risk <- as.factor(newtrain$credit_risk)
> newtrain$cut_purpose <- as.factor(newtrain$cut_purpose)
>  # 获取woe值
> woemodel<-klaR::woe(credit_risk~.,data = newtrain,zeroadj=0.5,applyontrain=TRUE)
> traindata <- predict(woemodel, newtrain, replace = TRUE)  
> str(traindata)
'data.frame':   800 obs. of  9 variables:
 $ credit_risk       : Factor w/ 2 levels "0","1": 2 2 1 2 1 1 2 2 1 1 ...
 $ woe.status        : num  1.127 1.127 1.127 -0.772 1.127 ...
 $ woe.credit_history: num  0.6989 0.6989 -0.0638 -0.0638 0.6989 ...
 $ woe.savings       : num  -0.292 -0.292 -0.292 -0.292 -0.292 ...
 $ woe.cut_duration  : num  -0.8058 0.0808 0.0808 -0.8058 0.0808 ...
 $ woe.cut_amount    : num  -0.55 0.419 0.419 0.419 0.419 ...
 $ woe.cut_age       : num  -0.612 0.161 0.161 -0.612 0.161 ...
 $ woe.cut_rate      : num  -0.15884 -0.00808 -0.15884 -0.15884 -0.15884 ...
 $ woe.cut_purpose   : num  -0.0199 -0.0199 0.1853 0.1853 0.1853 ...

至此,我们已经获得了入模变量对应的woe值。值的注意的是,我们之前将好客户设定为0,坏客户设定为1,所以woe值越大,代表客户违约的概率越大,但traindata中的woe实际是按照好客户为1,坏客户为0计算的,所以与之前变量分箱中计算的woe正好相反。
下面正式开始构建模型,并转换为标准评分卡。

6 模型构建与验证

6.1 构建逻辑回归模型

> # 用获得的woe数据进行逻辑回归
> trainmodel<-glm(credit_risk~.,data=traindata,family = binomial())
> summary(trainmodel)

Call:
glm(formula = credit_risk ~ ., family = binomial(), data = traindata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7274  -0.7140  -0.4202   0.7805   2.5005  

Coefficients:
                   Estimate Std. Error z value             Pr(>|z|)    
(Intercept)         -0.9024     0.0924  -9.766 < 0.0000000000000002 ***
woe.status          -0.7947     0.1205  -6.594      0.0000000000427 ***
woe.credit_history  -0.8322     0.1712  -4.861      0.0000011658520 ***
woe.savings         -0.8522     0.2102  -4.055      0.0000501789230 ***
woe.cut_duration    -0.7785     0.1875  -4.151      0.0000330526950 ***
woe.cut_amount      -0.7891     0.2259  -3.493             0.000478 ***
woe.cut_age         -0.9719     0.2803  -3.468             0.000525 ***
woe.cut_rate        -1.7358     0.5309  -3.269             0.001078 ** 
woe.cut_purpose     -1.0409     0.3958  -2.630             0.008541 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 963.44  on 799  degrees of freedom
Residual deviance: 754.15  on 791  degrees of freedom
AIC: 772.15

Number of Fisher Scoring iterations: 5

从以上逻辑回归的结果来看,各个变量都通过了显著性检验。同时,为防止多重共线性问题的出现,我们对模型进行VIF检验:

> vif(trainmodel)
        woe.status woe.credit_history        woe.savings   woe.cut_duration 
          1.043140           1.024758           1.039515           1.048207 
    woe.cut_amount        woe.cut_age       woe.cut_rate    woe.cut_purpose 
          1.088700           1.038714           1.055447           1.036517 

从结果可知,所有变量VIF均小于4,可以判断模型中不存在多重共线性问题。

6.2 转换为标准评分卡

根据信用评分卡模型的建立,我们可以得到:


其中,woe=ln(odds),odds为good_rate/bad_rate,beita为回归系数,altha为截距,n为变量个数,offset为偏移量(视风险偏好而定),比例因子factor。
假定odds=50,对应的评分为600,在此基础上评分值增加20分(1个PDO),可以使odds翻番,则可以得出:

6.2.1 转换为标准评分卡

获取基础分以及训练集中各变量的分数:

> # 设定alpha为比例因子factor,beta为风险偏移量offset
> alpha_beta <- function(basepoints,baseodds,pdo){
+   alpha <- pdo/log(2)
+   beta <- basepoints-alpha*log(baseodds)
+   return(list(alpha=alpha,beta=beta))
+ }
> # 指定0dds=50时,基础分为600分,比率翻番的分数为20,计算评分卡的系数alpha和beta
> (x <- alpha_beta(600,50,20))
$alpha
[1] 28.8539

$beta
[1] 487.1229
> # 获得模型系数
> (coefficients <- trainmodel$coefficients)
       (Intercept)         woe.status woe.credit_history        woe.savings 
        -0.9023751         -0.7946682         -0.8321739         -0.8522369 
  woe.cut_duration     woe.cut_amount        woe.cut_age       woe.cut_rate 
        -0.7785099         -0.7891044         -0.9718816         -1.7357656 
   woe.cut_purpose 
        -1.0409225 
> #构造计算分值函数:
> vars_score<-function(i){
+   score = -round(x$alpha*coefficients[i]*traindata[,names(coefficients[i])])
+   return(score)
+ }
> # 计算基础分值
> (basepoint <- round(x$beta-x$alpha*coefficients[1]))
(Intercept) 
        513 
> # 1.status_score
> status_score <- vars_score(2)
> colnames(status_score)<-"status_score" 
> # 2.credit_history_score
> credit_history_score <- vars_score(3)
> colnames(credit_history_score)<-"credit_history_score"
> # 3.savings_score
> savings_score <- vars_score(4)
> colnames(savings_score)<-"savings_score"
> # 4.duration_score
> duration_score <- vars_score(5)
> colnames(duration_score)<-"duration_score"
> # 5.amount_score
> amount_score <- vars_score(6)
> colnames(amount_score)<-"amount_score"
> # 6.age_score
> age_score <- vars_score(7)
> colnames(age_score)<-"age_score"
> # 7.rate_score
> rate_score <- vars_score(8)
> colnames(rate_score)<-"rate_score"
> # 8.purpose_score
> purpose_score <- vars_score(9)
> colnames(purpose_score)<-"purpose_score"
6.2.2 输出标准评分卡

输出CSV格式的标准评分卡:

> # 基础分
> a <- c("","basepoint",513)
> b <- matrix(r1,nrow = 1)
> colnames(b)<-c("Variable","Basepoint","Score")
> #2.duration的分值
> duration_Cutpoint <- as.matrix(newtrain$cut_duration,stringsAsFactors=F)
> duration_scoreCard<-cbind(as.matrix(c("Duration","",""),ncol=1),
+                           unique(cbind(duration_Cutpoint,duration_score)))
> #3.amount的分值
> amount_Cutpoint <- as.matrix(newtrain$cut_amount,stringsAsFactors=F)
> amount_scoreCard<-cbind(as.matrix(c("Amount","",""),ncol=1),
+                         unique(cbind(amount_Cutpoint,amount_score)))
> #4.age的分值
> age_Cutpoint <- as.matrix(newtrain$cut_age,stringsAsFactors=F)
> age_scoreCard<-cbind(as.matrix(c("Age",""),ncol=1),
+                      unique(cbind(age_Cutpoint,age_score)))
> #5.installment_rate的分值
> rate_Cutpoint <- as.matrix(newtrain$cut_rate,stringsAsFactors=F)
> rate_scoreCard<-cbind(as.matrix(c("Installment_rate","","",""),ncol=1),
+                       unique(cbind(rate_Cutpoint,rate_score)))
> #6.status的分值
> status <- as.matrix(newtrain$status,stringsAsFactors=F)
> status_scoreCard<-cbind(as.matrix(c("Status","","",""),ncol=1),
+                         unique(cbind(status,status_score)))
> #7.credit_history的分值
> credit_history <- as.matrix(newtrain$credit_history,stringsAsFactors=F)
> credit_history_scoreCard<-cbind(as.matrix(c("Credit_history","","","",""),ncol=1),
+                                 unique(cbind(credit_history,credit_history_score)))
> #8.savings的分值
> savings <- as.matrix(newtrain$savings,stringsAsFactors=F)
> savings_scoreCard<-cbind(as.matrix(c("Savings","","","",""),ncol=1),
+                          unique(cbind(savings,savings_score)))
> #9.purpose的分值
> purpose <- as.matrix(newtrain$cut_purpose,stringsAsFactors=F)
> purpose_scoreCard<-cbind(as.matrix(c("Purpose","","",""),ncol=1),
+                          unique(cbind(purpose,purpose_score)))
> scoreCard_CSV<-rbind(m1,duration_scoreCard,amount_scoreCard,age_scoreCard,
+                      rate_scoreCard,status_scoreCard,credit_history_scoreCard,
+                      savings_scoreCard,purpose_scoreCard)
> scoreCard_CSV<-rbind(b,duration_scoreCard,amount_scoreCard,age_scoreCard,
+                      rate_scoreCard,status_scoreCard,credit_history_scoreCard,
+                      savings_scoreCard,purpose_scoreCard)
> #输出标准评分卡到文件中
> write.csv(scoreCard_CSV,"C:/Users/Administrator/Desktop/ScoreCard.CSV")

7 模型验证

对测试集中的样本做同样的降维处理:

> # 对duration分段
> breaks1 <- c(0,7,30,Inf)
> test$cut_duration <- cut(test$duration,breaks = breaks1)
> # 对amount分段
> breaks2 <-  c(0,1372,3913,Inf)
> test$cut_amount <- cut(test$amount,breaks = breaks2)
> # 对age分段
> breaks3 <- c(0,25,Inf)
> test$cut_age <- cut(test$age,breaks = breaks3)
> # 对installment_rate分段
> test$cut_rate <- cut(test$installment_rate,4)
> # 对purpose分段
> test <- within(test,{
+   cut_purpose <- NA
+   cut_purpose[purpose=="car (new)"] <- "car(new/used)"
+   cut_purpose[purpose=="car (used)"] <- "car(new/used)"
+   cut_purpose[purpose=="furniture/equipment"] <- "furniture/equipment/radio/television/domestic appliances"
+   cut_purpose[purpose=="radio/television"] <- "furniture/equipment/radio/television/domestic appliances"
+   cut_purpose[purpose=="domestic appliances"] <- "furniture/equipment/radio/television/domestic appliances"
+   cut_purpose[purpose=="repairs"] <- "repairs/business/others"
+   cut_purpose[purpose=="education"] <- "education/retraining"
+   cut_purpose[purpose=="retraining"] <- "education/retraining"
+   cut_purpose[purpose=="business"] <- "repairs/business/others"
+   cut_purpose[purpose=="others"] <- "repairs/business/others"}) 
> newtest <- cbind(test[,qualt_model_vars][-4],test[,c(21:26)])
> newtest$credit_risk <- as.factor(newtest$credit_risk)
> newtest$cut_purpose <- as.factor(newtest$cut_purpose)
> # 将newtest中的各个变量转换为对应的woe值
> woemodel_test<-klaR::woe(credit_risk~.,data = newtest,zeroadj=0.5,applyontrain=TRUE)
> # 获得woe数据框
> testdata <- predict(woemodel_test, newtest, replace = TRUE)  
> # 测试集验证
> prob <- predict(trainmodel,testdata,type="response")
> logit.pred <- ifelse(prob>0.5,1,0)  #阈值简单设为0.5
> (Freq <- table(logit.pred,testdata$credit_risk))
          
logit.pred   0   1
         0 121  35
         1  11  33 
> # 准确率
> (ACC <- sum(diag(Freq))/sum(Freq))
[1] 0.77
> # AUC、Gini系数
> modelroc <- roc(testdata$credit_risk,prob)
> plot(modelroc, print.auc=TRUE, auc.polygon=T, grid=c(0.1, 0.2),
+      grid.col=c("green", "red"), max.auc.polygon=TRUE,
+      auc.polygon.col="skyblue", print.thres=T)
> modelauc<- auc(modelroc)
> (Gini <- 2*modelauc-1)
[1] 0.6106283
图7

从以上结果可知,模型准确率ACC为0.77,AUC为0.805,Gini系数为0.61,整体效果尚可。

8 总结

本文通过对Germancredit数据的挖掘分析,从数据清洗、变量筛选、WOE计算、建模分析到模型验证,创建了一个简单的申请评分卡。
本文用到的数据集比较简单,在实操中,数据清洗应该会占用更多时间和精力。
本文仅进行了一次样本抽样,在实操中,应进行K折交叉检验,提升模型准确度。
开发的模型是基于某一时间的特定样本的,随着时间的推移和信贷政策的变化,样本会发生变化,从而造成模型的区分能力和稳定性变差。一般需要定期对模型的使用情况进行检测并报告模型区分能力和稳定性的变化情况,必要时应采取包括修正模型或重建模型等措施。这是后面需要认真学习的地方!

参考

信用标准评分卡模型开发及实现
信用评分卡模型的建立

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 194,242评论 5 459
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 81,769评论 2 371
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 141,484评论 0 319
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 52,133评论 1 263
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 61,007评论 4 355
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 46,080评论 1 272
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 36,496评论 3 381
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 35,190评论 0 253
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 39,464评论 1 290
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 34,549评论 2 309
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 36,330评论 1 326
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 32,205评论 3 312
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 37,567评论 3 298
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 28,889评论 0 17
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 30,160评论 1 250
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 41,475评论 2 341
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 40,650评论 2 335

推荐阅读更多精彩内容