一、问题描述

给定数据集 $D=\left\{\left(\boldsymbol{x}_{1}, y_{1}\right),\left(\boldsymbol{x}_{2}, y_{2}\right), \ldots,\left(\boldsymbol{x}_{m}, y_{m}\right)\right\}$ 其中 $\boldsymbol{x}_{i}=\left(x_{i 1};x_{i 2} ; \ldots ; x_{i d} ), y_{i} \in \mathbb{R}\right.$
线性回归(linear regression)试图学得 $f\left(\boldsymbol{x}_{i}\right)=\boldsymbol{w}^{\mathrm{T}} \boldsymbol{x}_{i}+b$ ，使得 $f\left(\boldsymbol{x}_{i}\right) \simeq y_{i}$ 。
代价函数为 $J(\theta)=\frac{1}{2 m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2}$ ，其中 $h_{\theta}(x)=\theta_{0}+\theta_{1} x_{1}+\theta_{2} x_{2}+\ldots$ ，求解 $\boldsymbol \theta$ ，使代价函数最小，即代表拟合出来的方程距离真实值最近的过程称为线性回归的参数估计(parameter estimation)。
采用梯度下降法求解 $\boldsymbol \theta$ 参数。代价函数对 $\boldsymbol \theta$ 求偏导，可以得到： $\frac{\partial J(\theta)}{\partial \theta_{j}}=\frac{1}{m} \sum_{i=1}^{m}\left[\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{j}^{(i)}\right]$

由此 $\boldsymbol \theta$ 更新方式为： $\theta_{j}=\theta_{j}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left[\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{j}^{(i)}\right]$ 其中 $\alpha$ 为学习率(learning rate)。

二、实现过程

该部分使用一个案例来实现线性回归的过程，使用前两列数据来预测最后一列数据，数据格式和散点图如下图所示。本文采用原始手动实现和scikit-learn包两种方式实现这一过程。

部分数据

原始数据散点图

2.1、手动实现

需要用的包和数据输入。

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from mpl_toolkits.mplot3d import Axes3D

data_file = 'data.csv'

主函数，实现整个线性回归的过程

def LinearRegression(alpha=0.01,iters=500):
    # initial data
    data = pd.read_csv(data_file)
    data = np.array(data.values[:,:])
    X = data[:,:-1]  # all roww, all col except the last
    Y = data[:,-1]   # all row, the last col
    plot_source(X,Y)
    row = len(Y)
    col = data.shape[1]
    Y = Y.reshape(-1,1)     # reshape row vector to col vector
    X,mu,sigma = Standardiza(X)   # data normalization
    X = np.hstack((np.ones((row,1)),X))   # add one col 1 before X

    theta = np.zeros((col,1))
    theta,J_history = GradientDescent(X,Y,theta,alpha,iters)
    plot_J(J_history,iters)

    return mu,sigma,theta

数据标准化的过程

def Standardiza(matrix):
    norm_m = np.array(matrix)
    mu = np.zeros((1,matrix.shape[1]))
    sigma = np.zeros((1,matrix.shape[1]))
    mu = np.mean(norm_m,0)  # mean of each col, 0 for col
    sigma = np.std(norm_m,0) # standard deviation of each col

    for i in range(matrix.shape[1]):
        norm_m[:,i] = (norm_m[:,i]-mu[i])/sigma[i]
    return norm_m,mu,sigma

梯度下降的计算过程

def GradientDescent(X,Y,theta,alpha,iters):
    m,n = len(Y),len(theta)
    temp = np.matrix(np.zeros((n,iters)))
    J_history = np.zeros((iters,1))

    for i in range(iters):
        h = np.dot(X,theta)
        theta -=  ((alpha/m)*(np.dot(np.transpose(X),h-Y)))
        temp[:,i] = theta
        J_history[i] = Cost(X,Y,theta)
    return theta,J_history

代价函数和预测函数的计算过程

def Cost(X,Y,theta):
    m = len(Y)
    J = 0
    J = (np.dot(np.transpose(np.dot(X,theta) - Y),(np.dot(X,theta) - Y)))/(2*m)

    return J

def Predict(mu,sigma,theta):
    result = 0
    predict = np.array([1650,3])
    norm_p = (predict-mu)/sigma
    final_p = np.hstack((np.ones((1)),norm_p))
    result = np.dot(final_p,theta)
    print(result)

6.绘图函数

def plot_source(X,Y):
    fig = plt.figure(1)
    ax = Axes3D(fig)
    ax.set_xlabel('x[0]')
    ax.set_ylabel('x[1]')
    ax.set_zlabel('y')
    ax.scatter(X[:,0],X[:,1],Y)
    plt.show()

def plot_J(J_history,iters):
    x = np.arange(1,iters+1)
    plt.plot(x,J_history)
    plt.xlabel('iterations')
    plt.ylabel('cost value')
    plt.show()

主函数的调用

if __name__=='__main__':
    mu,sigma,theta = LinearRegression(0.01,500)
    Predict(mu,sigma,theta)

下图为代价函数的曲线变化，可见随着迭代进行代价函数值逐渐减小并收敛到一个较小值，测试数据的预测结果为263871.04186383。

代价函数变化曲线

拟合直线

2.2、调用scikit-learn线性回归包

调用sklearn的linear_model中的LinearRegression模型。输出结果为：coef_: [110248.92165868 -6226.22670553]； intercept_: 339119.45652173914； predict result: [292195.80095132]。

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler

data_file = 'data.csv'

def Linear_Regression(alpha=0.01,iters=500):
    # initial data
    data = pd.read_csv(data_file)
    data = np.array(data.values[:,:],np.float64)
    X = data[:,:-1]  # all roww, all col except the last
    Y = data[:,-1]   # all row, the last col

    scaler = StandardScaler()
    scaler.fit(X)
    x_train = scaler.transform(X)
    x_test = scaler.transform(np.array([[1650,3]]))

    model = linear_model.LinearRegression()
    model.fit(x_train,Y)

    result = model.predict(x_test)
    print(model.coef_)      # Coefficient of the features 
    print(model.intercept_)   # offset
    print(result)

if __name__=='__main__':
    Linear_Regression()

获取原始数据请点击这里，感谢lawlite19的贡献。

三、数据归一化和标准化

归一化(normalization) 是使数据都缩放到[0,1]区间。这样能够消除量纲，加快收敛，也能提高精度。常用表达式为： $x_{\text { norm }}^{(i)}=\frac{x_{i}-x_{\min }}{x_{\max }-x_{\min }}$

标准化(standardization) 是通过特征的平均值和标准差，将特征缩放成一个标准的正态分布，缩放后均值为0，方差为1。但即使数据不服从正态分布，也可以用此法。特别适用于数据的最大值和最小值未知，或存在孤立点。标准化是为了方便数据的下一步处理，而进行的数据缩放等变换，不同于归一化，并不是为了方便与其他数据一同处理或比较。表达式为： $x_{s t d}^{(i)}=\frac{x_{i}-\mu_{x}}{\delta_{x}}$ 其中 $\mu_{x}$ 表示特征的均值， $\delta_{x}$ 表示特征的标准差。

参考资料

[1] https://github.com/lawlite19/MachineLearning_Python
[2] 周志华著. 机器学习. 北京:清华大学出版社,2016
[3] https://www.jianshu.com/p/3761bad01053
[4] https://www.zhihu.com/question/20467170

相知无远近，万里尚为邻。 ——— 张九龄《送韦城李少府》

机器学习（一）：线性回归