1.blending
比如数据分成train和test,对于model_i(比如xgboost),即对所有的数据训练模型model_i,预测test数据生成预测向量v_i, 然后对train做CV fold=5, 然后对其他4份做训练数据,另外一份作为val数据,得出模型model_i_j,然后对val预测生成向量t_i_j, 然后将5分向量concat生成t_i,这是对应t_i与v_i对应, 每个模型都能生成这样一组向量,然后在顶层的模型比如LR或者线性对t向量进行训练,生成blender模型对v向量进行预测
也就是需要生成如下的一个表,训练集数据为把数据切分交叉生成,测试集为训练数据全部训练对测试集预测生成
id
model_1
model_2
model_3
model_4
label
1
0.1
0.2
0.14
0.15
0
2
0.2
0.22
0.18
0.3
1
3
0.8
0.7
0.88
0.6
1
4
0.3
0.3
0.2
0.22
0
5
0.5
0.3
0.6
0.5
1
blending 的优点是:比stacking简单,不会造成数据穿越,generalizers和stackers使用不同的数据,可以随时添加其他模型到blender中。
与stacking的区别是:
stacking在预测 测试集上时直接基于训练数据的
blender在预测 测试集上每次cv的子集都会预测下预测集, n次cv取平均
Blending:用不相交的数据训练不同的 Base Model,将它们的输出取(加权)平均。
Stacking:划分训练数据集为两个不相交的集合,在第一个集合上训练多个学习器,在第二个集合上测试这几个学习器,把第三步得到的预测结果作为输入,把正确的回应作为输出,训练一个高层学习器。
模型融合的模块
##模型融合的模块
from heamy.dataset import Dataset
from heamy.estimator import Regressor,Classifier
from heamy.pipeline import ModelsPipeline
##sklearn中常见模块
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor##随机森林回归
from sklearn.neighbors import KNeighborsRegressor##knn近邻回归
from sklearn.linear_model import LinearRegression #线性回归模型
from sklearn.model_selectionimport train_test_split ##训练集好测试集分开的模块
from sklearn.metrics import mean_absolute_error ##加载评估的模块
from sklearn import cross_validation,metrics
import pandas as pd
import os
os.chdir('F://gbdt学习')
data = load_boston()
X, y = data['data'], data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=111)
stack
df=pd.DataFrame(columns=['y_test','stacks','blend','weights'])
df.y_test=y_test
# create dataset
dataset = Dataset(X_train,y_train,X_test)
# initialize RandomForest &LinearRegression
model_rf = Regressor(dataset=dataset, estimator=RandomForestRegressor, parameters={'n_estimators': 50},name='rf')
model_lr = Regressor(dataset=dataset, estimator=LinearRegression,parameters={'normalize': True},name='lr')
pipeline = ModelsPipeline(model_rf,model_lr)
stack_ds = pipeline.stack(k=10,seed=111)
# Train LinearRegression on stacked data(second stage) 线性叠加
stacker = Regressor(dataset=stack_ds, estimator=LinearRegression)
results = stacker.predict() ##测试集的预测结果
df.stacks=results
# Validate results using 10 foldcross-validation
results = stacker.validate(k=10,scorer=mean_absolute_error)
blending
# load boston dataset from sklearn
from sklearn.datasets import load_boston
data = load_boston()
X, y = data['data'], data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=111)
# create dataset
dataset = Dataset(X_train,y_train,X_test)
# initialize RandomForest & LinearRegression
model_rf = Regressor(dataset=dataset, estimator=RandomForestRegressor, parameters={'n_estimators': 50},name='rf')
model_lr = Regressor(dataset=dataset, estimator=LinearRegression,parameters={'normalize': True},name='lr')
# Stack two models
# Returns new dataset with out-of-fold predictions
pipeline =ModelsPipeline(model_rf,model_lr)
stack_ds = pipeline.blend(proportion=0.2,seed=111)
# Train LinearRegression on stacked data(second stage)
stacker = Regressor(dataset=stack_ds, estimator=LinearRegression)
results = stacker.predict() ##预测的结果
df.blend=results
# Validate results using 10 foldcross-validation
results = stacker.validate(k=10,scorer=mean_absolute_error)
weights
model_rf = Regressor(dataset=dataset, estimator=RandomForestRegressor, parameters={'n_estimators': 151},name='rf')
model_lr = Regressor(dataset=dataset, estimator=LinearRegression,parameters={'normalize': True},name='lr')
model_knn = Regressor(dataset=dataset, estimator=KNeighborsRegressor,parameters={'n_neighbors': 15},name='knn')
pipeline = ModelsPipeline(model_rf,model_lr,model_knn)
weights = pipeline.find_weights(mean_absolute_error)
result = pipeline.weight(weights)
results=result.execute() ##预测的结果
metrics.mean_absolute_error(y_test,results)
df.weights=results
df.to_csv('results.csv',index=False)