Why Would We Want to Ensemble Learners Together?
There are two competing variables in finding a well fitting machine learning model: Bias and Variance.
Bias: When a model has high bias, this means that means it doesn't do a good job of bending to the data.
Variance: When a model has high variance, this means that it changes drastically to meet the needs of every point in our dataset.
1、机器学习算法中两个非常重要的影响因素:偏差和方差
高偏差机器学习算法会忽略训练数据,不能很好的拟合数据。
高方差的机器学习算法会对数据高度敏感,只能复现曾经见过的的东西,对于之前从未见过的情况,它的反应非常差。(因为没有适当的偏差让它泛化新的东西)
真正想要的算法是两者折中,也就是所谓的偏差--方差权衡。希望算法具有一定的泛化能力,但仍然对训练数据开放,能根据数据来调整模型。
Introducing Randomness Into Ensembles
Another method that is used to improve ensemble methods is to introduce randomness into high variance algorithms before they are ensembled together. The introduction of randomness combats the tendency of these algorithms to overfit (or fit directly to the data available). There are two main ways that randomness is introduced:
Bootstrap the data - that is, sampling the data with replacement and fitting your algorithm and fitting your algorithm to the sampled data.
Subset the features - in each split of a decision tree or with each algorithm used an ensemble only a subset of the total possible features are used.
2、随机森林算法:
随机从数据中挑选几列,并根据这些列构建决策树,然后随机选取其他的几列,再次构建决策树,然后让决策树进行选择。就只需让所有的决策树做出预测,并选取结果中显示最多的。
3、Bagging
4、Adaboost
5、Adaboost in sklearn
>>> from sklearn.ensemble import AdaBoostClassifier
>>> model = AdaBoostClassifier()
>>> model.fit(x_train, y_train)
>>> model.predict(x_test)
高参数
base_estimator:The model utilized for the weak learners (Warning: Don't forget to import the model that you decide to use for the weak learner).
n_estimators:The maximum number of weak learners used.
>>> from sklearn.tree import DecisionTreeClassifier
>>> model = AdaBoostClassifier(base_estimator = DecisionTreeClassifier(max_depth=2), n_estimators =4)
回顾:
在这节课学习了集成方法,两个权衡变量:偏差和方差。高偏差低方差的模型对数据拟合不够好,灵活性很低;低偏差高方差的模型会导致过拟合,灵活性太高了。
为了权衡偏差和方差,集成方法是一种普遍使用的方法。
有两种随机化技术来对抗过拟合:
1、Bootstrap the data - that is, sampling the data with replacement and fitting your algorithm and fitting your algorithm to the sampled data.
2、Subset the features - in each split of a decision tree or with each algorithm used an ensemble only a subset of the total possible features are used.
技术方法: