学习 Andrew Ng 吴恩达先生的《Machine Learning》，以及台湾国立大学林轩田先生的《机器学习基石》、《机器学习技法》，先将课程中涉及的机器学习的监督学习模型总结如下。

Classification

Classification 是指分类问题。

PLA

定义

PLA = Perceptrons Learning Algorithm ，属于 classification。一般说的 PLA 分为 Naive PLA 与 Pocket PLA。其中，感知机（英语：Perceptron）是一种二元线性分类器。

适用条件

二元线性分类。

如何使用

比较与拓展说明

Naive PLA算法的思想很简单。一直修正权重向量 W，直到向量 W 满足所有数据为止。Naive PLA的一大问题就是如果数据有杂音，不能完美的分类的话，算法就不会中止。所以，对于有杂音的数据，我们只能期望找到错误最少的结果。然后这是一个 NP Hard 问题。

Pocket PLA 一个贪心的近似算法，和 Naive PLA 算法类似。变顺序迭代为随机迭代，如果找出错误，则修正结果。在修正过程中，记录犯错误最少的向量。

Regression

Regression 与 Classification 的比较：
Classification trees have dependent variables that are categorical and unordered. Regression trees have dependent variables that are continuous values or ordered whole values. Regression means to predict the output value using training data. Classification means to group the output into a class.

When it comes to how to figure out which is a classification problem and which is a regression problem, an easy way to think about it is to ask yourself if you are trying to predict which class (or category) something belongs to or are you trying to predict a value.

Predicting a class is classification (ham/spam, image of a cat/not an image of a cat, etc...)Predicting a value (a number) is regression. (Housing prices, tomorrows temperature, etc...) Classification can.be built on top of regression.

Linear Regression

定义

Linear Regression.png

In statistics, linear regression is an approach for modeling the relationship between a scalar[标量的] dependent variable y and one or more explanatory variables (or independent variables) denoted X. The case of one explanatory variable is called simple linear regression.

适用条件

Training Set 中的数据是线性分布的，且输出的预计量也为数字。

如何使用

Logistic Regression

定义

Logistic Regression.png

Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

适用条件

输出的预计量也为分出的类别。

如何使用

比较与拓展说明

linear regression 与 logistic regression 的区别：
In linear regression, the outcome (dependent variable) is continuous. It can have any one of an infinite number of possible values. In logistic regression, the outcome (dependent variable) has only a limited number of possible values. Logistic Regression is used when response variable is categorical in nature.

Generative Learning algorithms

Consider a classification problem in which we want to learn to distinguish between elephants (y = 1) and dogs (y = 0), based on some features of an animal. Given a training set, an algorithm like logistic regression or the perceptron algorithm (basically) tries to find a straight line—that is, a decision boundary—that separates the elephants and dogs. Then, to classify a new animal as either an elephant or a dog, it checks on which side of the decision boundary it falls, and makes its prediction accordingly.

Gaussian Discriminant Analysis model（GDA)

定义

GDA.png

GDA, is a method for data classification commonly used when data can be approximated with a Normal distribution. You will need a training set, i.e. a bunch of data yet classified. These data are used to train your classifier, and obtain a discriminant function that will tell you to which class a data has higher probability to belong.

适用条件

Training data can be approximated with a Normal distribution.

如何使用

比较与拓展说明

GDA 与 Logistic Regression 的区别：
高斯判别算法(strong assumption)与logistic收敛(week assumption)。可参见 Andrew NG Notes2, Page 6 of 14.

回归模型是判别模型，也就是根据特征值来求结果的概率。比如说要确定一只羊是山羊还是绵羊，用判别模型的方法是先从历史数据中学习到模型，然后通过提取这只羊的特征来预测出这只羊是山羊的概率，是绵羊的概率。换一种思路，我们可以根据山羊的特征首先学习出一个山羊模型，然后根据绵羊的特征学习出一个绵羊模型。然后从这只羊中提取特征，放到山羊模型中看概率是多少，再放到绵羊模型中看概率是多少，哪个大就是哪个。

Naive Bayes 朴素贝叶斯

定义

Naive Bayes.png

It is a classification technique based on Bayes' Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

适用条件

Training set data xi are discrete-valued.

如何使用

假设关键词与关键词没有关联，常用于垃圾邮件分类等一类的分类问题。

比较与拓展说明

GDA VS Bayes
In GDA, the feature vectors x were continuous, real-valued vectors. Lets now talk about a different learning algorithm in which the xi’s are discrete-valued
Logistic regression VS Naive Bayes
Logistic Regression comes under the category of a Discriminative classifier, which models the posterior P(class|x) directly from the data, or learn a direct map from inputs x to the class labels.
Whereas, Discriminant Analysis is a Generative classifier that learns a model of the joint probability P(x,class) and makes their predictions by Bayes' rule

Support Vector Machine (SVM)

SVM is a supervised machine learning algorithm which can be used for classification or regression problems. It uses a technique called the kernel trick to transform your data and then based on these transformations it finds an optimal boundary between the possible outputs. Simply put, it does some extremely complex data transformations, then figures out how to seperate your data based on the labels or outputs you've defined.

when it comes to computing the SVM classifier, there are three approaches: primal, dual and kernel.

Linear SVM

Margin: If the training data are linearly separable, we can select two parallel hyperplanes that separate the two classes of data, so that the distance between them is as large as possible. The region bounded by these two hyperplanes is called the "margin", and the maximum-margin hyperplane is the hyperplane that lies halfway between them.

Hard and soft margin:……

Non-linear SVM

The idea is to gain linearly separation by mapping the data to a higher dimensional space.

AdaBoost（Adaptive Boosting）

定义

参见林轩田 Chapter 7 - 8

AdaBoost, short for "Adaptive Boosting, is a machine learning algorithm. It can be used in conjunction with many other types of learning algorithms to improve their performance. **The output of the other learning algorithms ('weak learners') is combined into a weighted sum that represents the final output of the boosted classifier. **AdaBoost is adaptive in the sense that **subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers. **

AdaBoost is sensitive to noisy data and outliers. In some problems it can be less susceptible to the overfitting problem than other learning algorithms. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing (e.g., their error rate is smaller than 0.5 for binary classification), the final model can be proven to converge to a strong learner.

使用方法

推导思路与过程

adaboost is actually something like aggregation.
uniform blending or linear blending -> Bagging（Bootstrap Aggregation: resampling from D given）-> boosting(Focus on key examples(wrong predictions)) -> re-weighting different g -> adaptive boosting algorithm(Scale up incorrect -> dif hypothesis)

解释说明：
blending：aggregate after getting gt
learning：aggregate as well as getting gt

（Bootstrap Aggregation）：用同一份资料得到不同的 g
Bootstrapping - resampling from D given, re-sample N examples form D uniformly with replacement(有放回的取出一笔又一笔的资料)

AdaBoosting:
U = 开根号(e/(1-e)): 错误越大，对形成 G 越重要，则权重比 U 越大。

Decision Tree

定义

Decision tree learning uses a decision tree as a predictive model observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves).

Random Forest

See more on 林轩田机器学习技法 Chapter 10.

定义

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set.

Random Forest = bagging + decision tree.

使用方法

推导思路与过程

Out-of-bag (OOB) error
also called out-of-bag estimate, is a method of measuring the prediction error of random forests, boosted decision trees, and other machine learning models utilizing bootstrap aggregating to sub-sample data samples used for training. Eoob is self-validation of bagging/RF.

OutOfBag.png

Feature Selection

Permutation 方法 是将某个 Feature 下的数据乱序排列，再将这个 Feature 下的乱序数据和其他 Feature 下的原始数据重新组合起来，看该 Feature 数据乱序之后知否对整体产生重大影响。如果是，则该 Feature 很重要。如下图：

Permutation.png

事实上如下图，对于 RF，feature selection 要通过 permutation + OOB。

image.png

Gradient boosted decision tree

定义

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.

如何使用

推导过程

Gradient boosted decision tree.png

image.png

Gradient Boosted Decision Tree - GBDT
整体思路：sn 是根据数据 xn 和 gt 预测出来的值，yn 是真实值，yn-sn 是残差。我们会用切割后的数据集 x、残差 y-s 作为新的数据集，使用新的 gt （仍是 DecisionTree）做新的数据切割和预测，一直到残差无限接近于 0，即预测值和真实值非常接近。

A 是我们未知的一个 regression 算法，采用的是 squared error 方法，然后决定采用 C&RT decision tree 做我们的 gt。可以简单理解为 A = gt = C&RT。
第一步将数据切一刀之后，at 是根据切分后的这部分数据做出的单变量 linear regression 的斜率，体现了我们 regression 的算法。此时的 gt(xn) 是采用 decision tree gt(x) 切后的那一部分数据。yn - sn
s (score) = s + at*gt(xn)，其中此时的 s 是根据 linear regression 和 X 做出的预测值。

将该预测值和真实的 yn 的求差值。

GBDT.png

如下内容本文暂不涉及 neural network

参考链接

文中的参考链接以链接形式已在原文标出，其他参考链接或建议额外阅读的链接列举如下：

机器学习-监督学习模型总结 V1.1

Classification

PLA

定义

适用条件

如何使用

比较与拓展说明

Regression

Linear Regression

定义

适用条件

如何使用

Logistic Regression

定义

适用条件

如何使用

比较与拓展说明

Generative Learning algorithms

Gaussian Discriminant Analysis model（GDA)

定义

适用条件

如何使用

比较与拓展说明

Naive Bayes 朴素贝叶斯

定义

适用条件

如何使用

比较与拓展说明

Support Vector Machine (SVM)

Linear SVM

Non-linear SVM

AdaBoost（Adaptive Boosting）

定义

使用方法

推导思路与过程

Decision Tree

定义

Random Forest

定义

使用方法

推导思路与过程

Gradient boosted decision tree

定义

如何使用

推导过程

参考链接

推荐阅读更多精彩内容