背景:
目前有一批汽车信用贷款用户违约数据(客户属性 + 账号属性 + 消费行为 +还款行为),市场部门想根据这些数据建立决策树模型从而观察违约用户的违约模式,进而调整业务。
数据源:
data.csv(一份汽车贷款违约数据)
样本量:7193
建模方法: 决策树-C5.0
建模结果:
代码
> setwd("C:\\Users\\Administrator\\Desktop\\重新跑模型\\data\\")
> accepts<-read.csv("accepts.csv")
> accepts$bad_ind<-as.factor(accepts$bad_ind)
> names(accepts)
[1] "application_id" "account_number" "bad_ind" "vehicle_year" "vehicle_make" "bankruptcy_ind"
[7] "tot_derog" "tot_tr" "age_oldest_tr" "tot_open_tr" "tot_rev_tr" "tot_rev_debt"
[13] "tot_rev_line" "rev_util" "fico_score" "purch_price" "msrp" "down_pyt"
[19] "loan_term" "loan_amt" "ltv" "tot_income" "veh_mileage" "used_ind"
> accepts=accepts[,c(3,7:24)]
> #根据业务理解生成更有意义的衍生变量,不过这些变量都是临时的,因为没有经过数据清洗,此处仅作一个示例
> #accepts$lti_temp=accepts$loan_amt/accepts$tot_income
>
> set.seed(10)
> select<-sample(1:nrow(accepts),length(accepts$bad_ind)*0.7)
> train=accepts[select,]
> test=accepts[-select,]
> summary(train$bad_ind)
0 1
3233 858
> ###################################
> ## Section 1: C50算法
> ###################################
> train<-na.omit(train)
> library(C50)
> #请注意,R中的C50包比较新,存在一些问题,比如遇到缺失值、字符类型变量会报错“c50 code called exit with value 1”
> ##建模
> tc<-C5.0Control(subset =F,CF=0.25,winnow=F,noGlobalPruning=F,minCases =20)
> model <- C5.0(bad_ind ~.,data=train,rules=F,control =tc)
> summary( model )
Call:
C5.0.formula(formula = bad_ind ~ ., data = train, rules = F, control = tc)
C5.0 [Release 2.07 GPL Edition] Mon May 22 21:35:14 2017
-------------------------------
Class specified by attribute `outcome'
Read 3001 cases (19 attributes) from undefined.data
Decision tree:
fico_score > 661: 0 (2161/262)
fico_score <= 661:
:...tot_tr > 13:
:...ltv <= 83: 0 (49/4)
: ltv > 83:
: :...fico_score <= 588: 1 (52/20)
: fico_score > 588: 0 (411/125)
tot_tr <= 13:
:...rev_util > 116: 1 (26/5)
rev_util <= 116:
:...used_ind > 0: 0 (181/78)
used_ind <= 0:
:...purch_price <= 25000: 1 (92/40)
purch_price > 25000: 0 (29/5)
Evaluation on training data (3001 cases):
Decision Tree
----------------
Size Errors
8 539(18.0%) <<
(a) (b) <-classified as
---- ----
2357 65 (a): class 0
474 105 (b): class 1
Attribute usage:
100.00% fico_score
27.99% tot_tr
17.06% ltv
10.93% rev_util
10.06% used_ind
4.03% purch_price
Time: 0.0 secs
> #图形展示
> plot(model)
> C5imp(model)
Overall
fico_score 100.00
tot_tr 27.99
ltv 17.06
rev_util 10.93
used_ind 10.06
purch_price 4.03
tot_derog 0.00
age_oldest_tr 0.00
tot_open_tr 0.00
tot_rev_tr 0.00
tot_rev_debt 0.00
tot_rev_line 0.00
msrp 0.00
down_pyt 0.00
loan_term 0.00
loan_amt 0.00
tot_income 0.00
veh_mileage 0.00
> #生成规则
> rule<- C5.0(bad_ind ~.,data=train,rules=T,control =tc)
> summary( rule )
Call:
C5.0.formula(formula = bad_ind ~ ., data = train, rules = T, control = tc)
C5.0 [Release 2.07 GPL Edition] Mon May 22 21:35:15 2017
-------------------------------
Class specified by attribute `outcome'
Read 3001 cases (19 attributes) from undefined.data
Rules:
Rule 1: (2161/262, lift 1.1)
fico_score > 661
-> class 0 [0.878]
Rule 2: (2015/301, lift 1.1)
tot_tr > 13
-> class 0 [0.850]
Rule 3: (2879/531, lift 1.0)
rev_util <= 116
-> class 0 [0.815]
Rule 4: (26/5, lift 4.1)
tot_tr <= 13
rev_util > 116
fico_score <= 661
-> class 1 [0.786]
Default class: 0
Evaluation on training data (3001 cases):
Rules
----------------
No Errors
4 563(18.8%) <<
(a) (b) <-classified as
---- ----
2417 5 (a): class 0
558 21 (b): class 1
Attribute usage:
96.80% rev_util
72.88% fico_score
68.01% tot_tr
Time: 0.1 secs
参考资料:CDA《信用风险建模》微专业