WHAT - DEFINITION:
Machine learning algorithms:
• Supervised learning: tech computer
• unsupervised learning: let computer learn by itself
• Others: reinforcement learning, recommender systems.
Supervised learning:
training set with given label
regression problem: to predict a continuous valued output.
classification problem: to predict a discrete valued output.
Unsupervised learning:
data set without given label (classified by the machine itself)
symbols during study:
m = number of training examples
x's = "input" variable / features
y's = "output" variable / "target" variable
h = hypothesis, h maps from x's to y's, like x -h-> y
Linear regression with one variable/ univariate linear regression: 一元线性回归
*Used for contiuous valued problem
Cost function:
*代价函数(有的地方也叫损失函数,Loss Function)在机器学习中的每一种算法中都很重要,因为训练模型的过程就是优化代价函数的过程,代价函数对每个参数的偏导数就是梯度下降中提到的梯度,防止过拟合时添加的正则化项也是加在代价函数后面的。概况来讲,任何能够衡量模型预测出来的值h(θ)与真实值y之间的差异的函数都可以叫做代价函数C(θ),如果有多个样本,则可以将所有代价函数的取值求均值,记做J(θ)。
性质如下:
• 对于每种算法来说,代价函数不是唯一的;
• 代价函数是参数θ的函数;
• 总的代价函数J(θ)可以用来评价模型的好坏,代价函数越小说明模型和参数越符合训练样本(x, y);
• J(θ)是一个标量(无向量);
1. 均方误差 squared error function - suitable for linear regression
Gradient descent algo: 最小梯度法。
正确的做法:同步更新θ_0 & θ_1
α: learning rate
If α is too small, gradient descent can be slow. If too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge.
Learning rate can be fixed. As we approach a local minimum, gradient descent will automatically take smaller steps. (because the derrivative term will automatically get smaller) So, no need to decrease α over time.
"Batch" / Batch梯度下降算法: Each step of gradient descent uses all the training examples.
其中,
Gradient descent for multiple variables: 多元梯度下降法。
Feature scaling 特征缩放:
What: make sure features are on a similar scale to fasten the gradient descent.
Why: 等值线越椭圆,路径越曲折,所需时间越长。
How: Get every feature into approximately a -1 <= xi <= 1 range. (no bigger than ±3, no less than ±1/3)
Mean Normalization 均值归一化:
Or, X = (value - μ) / σ
X = 5184- 6675.5/8836
选择合适的多项式进行拟合:
*Iteration - 迭代。
Normal Equation 正则方程:
得到最优的θ.
In Octave: pinv(X'*X)*X'*y
不需要特征缩放。
正则方程与梯度下降对比:
如果X'X不可逆:
Logistic Regression 逻辑回归:
Most widely used classification algorithm in the world.
Decision boundary:
区分不同预测结果的边界,通常要根据不同的问题(依据散点图形or经验)来拟合。
Eg.
Simplify cost function for Logistic regression:
* 逻辑回归和线性回归的cost function都是凸函数,但如果把线性回归的cost function搬到逻辑回归的话就会产生非凸函数。
Come out the Logistic Regression Cost Function as below:
To further simplify the Logistic regression cost function, turns out:
So, if y=1, to minimize cost function J(θ), turns out h(x) with θ.
The gradient descent for Logistic regression:
Feature scaling also applies to Logistic regression to fasten converge speed.
Neural Networks 神经网络:
适用于大数据量问题。
逻辑单元:模拟神经元和它的树突(input)和轴突(output)
Neural Network - a group of different neurons strung together
input layer 和output layer之间有hidden layer
theta 的下标和a的下标 以及 x的下标 对应
前一层的a输出下一层的a,最后一层被输出的a即是h(x)
若将x, a们向量化(并将x置换为),可以得到z_=theta*x,
从input layer -> hidden layer -> output layer 层层递进,如下:
即:
两层hidden layers的例子:
计算异或非的例子:
多元分类的例子:
反向传播算法求梯度:
拓展解析:https://www.cnblogs.com/vipyoumay/p/9334961.html
随机初始化theta:
why - 若不随机,会造成高度冗余,所有theta都是同一数值,最后只能得出一个特征
随机初始化:
Algorithm:
Cocktail party problem algorithm:
[W,s,v] = svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');