- 降维方法:
- principal component analysis
- conical correlation analysis
- singular value decomposition
- 原始数据预处理,三步骤
- data preprocessing
- feature engineering
- feature selection;其中特征选择又有3方法,即
- filter;select the best subset
- wrapper; generate a subset---->learning algorithm 循环;
- embedded method; generate a subset---->learning algorithm + performance 循环;
-
The process of machine learning机器学习步骤
Some classification algorithms
- nearest neighbour
- Linear svm
- RBF svm
- Gaussian process
- decision tree
- random forest
- neural net
- ada boost
- naive bayes
-
QDA
-
几种算法
A. Regression- Ordinal Regression序数回归: data in rank ordered categories
- Poisson Regression: predicts event counts
- Fast forest quantile regression: predicts a distribution
- Linear regression: fast training, linear model
- Bayesian linear regression: linear model, small data sets
- neural network regression: accurate, long training times
- decision forest regression: accurate, fast training times
- boosted decision tree regression: accurate, fast training times, large memory footprint
B. Clustering - K-means: unsupervised learning
C. Anomaly detection 异常检测 - PCA-Based Anomaly detection: fast training times
- Two-class classification: under 100 features, aggressive boundary
D. Two-class classification - two-class SVM: under 100 features, linear model
- two-class averaged perceptron: fast training, linear model
- two-class bayes point machine: fast training, linear model
- two-class decision forest
- two-class regression
- two-class boosted decision tree
- two-class decision jungle
- two-class locally deep SVM
- two-class neural network
E. Multiclass Classification - multiclass logistic regression
- multiclass neural network
- multiclass decision forest
- multiclass decision jungle
- one-v-all multiclass: depend on the two-class classifier
Semi-supervised learning
Between supervised learning and unsupervised learning; 少部分数据有label,大多数数据没有label; 有高准确率,且与supervised learning相比,它训练成本低很多。-
Reinforcement Learning增强学习
从一系列动作中,学习到最大反馈方程,此处反馈方程可以是“bad actions”或“good action”; 增强学习常常用于自动驾驶中,即通过周遭环境的一系列反馈来做出决定。
-
机器学习算法,分类图
一个tip
如果训练过程中,数据结果很好,但在评估阶段结果很差,那很有可能是overfitting了。-
常用validation的三种方法
hold-out validation,预留校验数据;适用大数据样本
-
k-fold cross validation,将训练集分成k等份;适用小数据样本
leave-one-out validation(LOOCV),特殊的k-fold交叉校验,重复直至每个观察样本都作为过了校验数据。
-
评估模型的几种方法
-
A. accuracy(精确率), precision(查准率),recall(查全率)
如何判断哪个模型效果最好,可以通过F score,相关定义方程如下:
F越大越好
-
B. ROC curves
其中ROC 曲线图的优点是不受类分布(不平衡类分布)的 影响
-
C. AUC (area under curve)
其中,auc越高越好
-
D. R平方,coefficient of determination,【0,1】
It is a standard way of measuring how well the model fits the data.
缺点是:R总是这增长,从不会减少,所以数据更多的模型,它的R值总是更大,就会认为该模型更好;此外,如果训练数据更高阶,那么噪声很容易被误认为待训练数据,即噪声参与了模型的训练
一个tip
有时候一个准确率很高的模型并不能说它是有用的,比如,一个模型说99%无癌症,1%有癌症,这是一个样本分布不均匀的案例, 此时需要建立两个模型,模型A用来判定有癌症,模型B用来判定无癌症-
Bias和Variance问题
underfit属于high bias
overfit属于high variant
判断模型的好坏的过程中,如果训练集效果很好,但是校验集不好,那么是high variance问题(即overfit);如果训练集和校验集效果都不好,那么是high bias问题(即underfit)。
解决方法: