一、补充知识点:
-
满秩矩阵
方程三元一次方程,解是唯一的
-
方程提取成矩阵X(满秩矩阵):
[[2,3,1],
[1,2,2],
[3,4,-1]]
-
提取的目标值是y:
[10,
9,
9]
-
非满秩矩阵 = 奇异矩阵
方程2和方程3共线性,一个方程
-
方程提取成矩阵X(非满秩矩阵,奇异矩阵):
[[2,3,1],
[1,2,2],
[2,4,4]]
-
提取的目标值是y:
[10,
9,
18]
二、lasso回归
Lasso回归=套索回归
线性回归的基础上,进行的改进
防止过拟合
线性回归相比,增加了一阶正则项
正则化 : L1正则化(Lasso,一次幂),L2正则化(Ridge,平方,2次幂)
两种情况,正负
方程汇总带着绝对值,不可以求导,必须去掉绝对值:正数直接去掉,负数,前面加负号
-
- sgn(w)是符号表示函数
- sgn(w) 代表着w > 0 那么sgn(w) = +1
- 如果w < 0 那么sgn(w) = -1
-
梯度下降求解方程系数,更新规则:
- Lasso回归和线性回归相比,多了一项:
- 大于零
- 大于零
- 等价于,Lasso回归多了一项:
- 当w为正时候 sgn(w) = +1,直接去掉 减去 所以正的w变小了
- 当w为负时候sgn(w) = -1,负号变成了正号 加上了 ,负数w取向零
Lasso回归系数缩减到0,岭回归,不可以的
系数(权重)变成0了,说明:可有可无,属性不重要
线性回归和Lasso回归对比(天池工业蒸汽量预测):
代码如下:
导包
import numpy as np
from sklearn.linear_model import LinearRegression,Ridge
# train_test_split 数据一分为二
from sklearn.model_selection import train_test_split
# metrics 评估
# mean_squared_error均方误差
from sklearn.metrics import mean_squared_error
# pandas非常重要,机器学习,深度学习中,都要加载,处理数据
import pandas as pd
d:\python3.7.4\lib\importlib\_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
return f(*args, **kwds)
d:\python3.7.4\lib\importlib\_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
return f(*args, **kwds)
加载数据
# 脱敏数据
train = pd.read_csv('./zhengqi_train.txt',sep = '\t')
train
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>V0</th>
<th>V1</th>
<th>V2</th>
<th>V3</th>
<th>V4</th>
<th>V5</th>
<th>V6</th>
<th>V7</th>
<th>V8</th>
<th>V9</th>
<th>...</th>
<th>V29</th>
<th>V30</th>
<th>V31</th>
<th>V32</th>
<th>V33</th>
<th>V34</th>
<th>V35</th>
<th>V36</th>
<th>V37</th>
<th>target</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0.566</td>
<td>0.016</td>
<td>-0.143</td>
<td>0.407</td>
<td>0.452</td>
<td>-0.901</td>
<td>-1.812</td>
<td>-2.360</td>
<td>-0.436</td>
<td>-2.114</td>
<td>...</td>
<td>0.136</td>
<td>0.109</td>
<td>-0.615</td>
<td>0.327</td>
<td>-4.627</td>
<td>-4.789</td>
<td>-5.101</td>
<td>-2.608</td>
<td>-3.508</td>
<td>0.175</td>
</tr>
<tr>
<th>1</th>
<td>0.968</td>
<td>0.437</td>
<td>0.066</td>
<td>0.566</td>
<td>0.194</td>
<td>-0.893</td>
<td>-1.566</td>
<td>-2.360</td>
<td>0.332</td>
<td>-2.114</td>
<td>...</td>
<td>-0.128</td>
<td>0.124</td>
<td>0.032</td>
<td>0.600</td>
<td>-0.843</td>
<td>0.160</td>
<td>0.364</td>
<td>-0.335</td>
<td>-0.730</td>
<td>0.676</td>
</tr>
<tr>
<th>2</th>
<td>1.013</td>
<td>0.568</td>
<td>0.235</td>
<td>0.370</td>
<td>0.112</td>
<td>-0.797</td>
<td>-1.367</td>
<td>-2.360</td>
<td>0.396</td>
<td>-2.114</td>
<td>...</td>
<td>-0.009</td>
<td>0.361</td>
<td>0.277</td>
<td>-0.116</td>
<td>-0.843</td>
<td>0.160</td>
<td>0.364</td>
<td>0.765</td>
<td>-0.589</td>
<td>0.633</td>
</tr>
<tr>
<th>3</th>
<td>0.733</td>
<td>0.368</td>
<td>0.283</td>
<td>0.165</td>
<td>0.599</td>
<td>-0.679</td>
<td>-1.200</td>
<td>-2.086</td>
<td>0.403</td>
<td>-2.114</td>
<td>...</td>
<td>0.015</td>
<td>0.417</td>
<td>0.279</td>
<td>0.603</td>
<td>-0.843</td>
<td>-0.065</td>
<td>0.364</td>
<td>0.333</td>
<td>-0.112</td>
<td>0.206</td>
</tr>
<tr>
<th>4</th>
<td>0.684</td>
<td>0.638</td>
<td>0.260</td>
<td>0.209</td>
<td>0.337</td>
<td>-0.454</td>
<td>-1.073</td>
<td>-2.086</td>
<td>0.314</td>
<td>-2.114</td>
<td>...</td>
<td>0.183</td>
<td>1.078</td>
<td>0.328</td>
<td>0.418</td>
<td>-0.843</td>
<td>-0.215</td>
<td>0.364</td>
<td>-0.280</td>
<td>-0.028</td>
<td>0.384</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>2883</th>
<td>0.190</td>
<td>-0.025</td>
<td>-0.138</td>
<td>0.161</td>
<td>0.600</td>
<td>-0.212</td>
<td>0.757</td>
<td>0.584</td>
<td>-0.026</td>
<td>0.904</td>
<td>...</td>
<td>0.128</td>
<td>-0.208</td>
<td>0.809</td>
<td>-0.173</td>
<td>0.247</td>
<td>-0.027</td>
<td>-0.349</td>
<td>0.576</td>
<td>0.686</td>
<td>0.235</td>
</tr>
<tr>
<th>2884</th>
<td>0.507</td>
<td>0.557</td>
<td>0.296</td>
<td>0.183</td>
<td>0.530</td>
<td>-0.237</td>
<td>0.749</td>
<td>0.584</td>
<td>0.537</td>
<td>0.904</td>
<td>...</td>
<td>0.291</td>
<td>-0.287</td>
<td>0.465</td>
<td>-0.310</td>
<td>0.763</td>
<td>0.498</td>
<td>-0.349</td>
<td>-0.615</td>
<td>-0.380</td>
<td>1.042</td>
</tr>
<tr>
<th>2885</th>
<td>-0.394</td>
<td>-0.721</td>
<td>-0.485</td>
<td>0.084</td>
<td>0.136</td>
<td>0.034</td>
<td>0.655</td>
<td>0.614</td>
<td>-0.818</td>
<td>0.904</td>
<td>...</td>
<td>0.291</td>
<td>-0.179</td>
<td>0.268</td>
<td>0.552</td>
<td>0.763</td>
<td>0.498</td>
<td>-0.349</td>
<td>0.951</td>
<td>0.748</td>
<td>0.005</td>
</tr>
<tr>
<th>2886</th>
<td>-0.219</td>
<td>-0.282</td>
<td>-0.344</td>
<td>-0.049</td>
<td>0.449</td>
<td>-0.140</td>
<td>0.560</td>
<td>0.583</td>
<td>-0.596</td>
<td>0.904</td>
<td>...</td>
<td>0.216</td>
<td>1.061</td>
<td>-0.051</td>
<td>1.023</td>
<td>0.878</td>
<td>0.610</td>
<td>-0.230</td>
<td>-0.301</td>
<td>0.555</td>
<td>0.350</td>
</tr>
<tr>
<th>2887</th>
<td>0.368</td>
<td>0.380</td>
<td>-0.225</td>
<td>-0.049</td>
<td>0.379</td>
<td>0.092</td>
<td>0.550</td>
<td>0.551</td>
<td>0.244</td>
<td>0.904</td>
<td>...</td>
<td>0.047</td>
<td>0.057</td>
<td>-0.042</td>
<td>0.847</td>
<td>0.534</td>
<td>-0.009</td>
<td>-0.190</td>
<td>-0.567</td>
<td>0.388</td>
<td>0.417</td>
</tr>
</tbody>
</table>
<p>2888 rows × 39 columns</p>
</div>
# test测试数据,没有target
# 模型算出来,计算
# 标准答案,阿里巴巴服务器
# 使用模型计算结果,提交到阿里巴巴,评分
test = pd.read_table('./zhengqi_test.txt')
test.head()
test.shape
(1925, 38)
# 通过DataFrame切片的方式,获取了数据X和目标值y
# 数据,特征,38个特征,工场,传感器采集
X = train.iloc[:,:-1]
# y是目标值
y = train['target']
df = pd.DataFrame(data = np.random.randint(0,151,size = (10,3)),
columns=['Python','En','Math'],
index = list('ABCDEFHIJK') )
df
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Python</th>
<th>En</th>
<th>Math</th>
</tr>
</thead>
<tbody>
<tr>
<th>A</th>
<td>8</td>
<td>72</td>
<td>22</td>
</tr>
<tr>
<th>B</th>
<td>124</td>
<td>98</td>
<td>41</td>
</tr>
<tr>
<th>C</th>
<td>62</td>
<td>26</td>
<td>122</td>
</tr>
<tr>
<th>D</th>
<td>114</td>
<td>22</td>
<td>7</td>
</tr>
<tr>
<th>E</th>
<td>135</td>
<td>78</td>
<td>88</td>
</tr>
<tr>
<th>F</th>
<td>44</td>
<td>119</td>
<td>99</td>
</tr>
<tr>
<th>H</th>
<td>140</td>
<td>58</td>
<td>135</td>
</tr>
<tr>
<th>I</th>
<td>145</td>
<td>35</td>
<td>91</td>
</tr>
<tr>
<th>J</th>
<td>135</td>
<td>106</td>
<td>5</td>
</tr>
<tr>
<th>K</th>
<td>52</td>
<td>28</td>
<td>59</td>
</tr>
</tbody>
</table>
</div>
# 行索引
# df.loc['H']
# df.loc[['H','D','A']]
# df.loc[0] # 行索引中,没有0,找不到,程序崩溃
df.loc['J','Python']
135
# 行索引,行索引默认值0,1,2,3,……
# df.iloc[0]
# df.iloc[[0,3,5]]
df.iloc[2:6,:-1]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Python</th>
<th>En</th>
</tr>
</thead>
<tbody>
<tr>
<th>C</th>
<td>62</td>
<td>26</td>
</tr>
<tr>
<th>D</th>
<td>114</td>
<td>22</td>
</tr>
<tr>
<th>E</th>
<td>135</td>
<td>78</td>
</tr>
<tr>
<th>F</th>
<td>44</td>
<td>119</td>
</tr>
</tbody>
</table>
</div>
训练数据,划分成两份,一份是训练,另一份用于验证,评估,模型好坏
# 算法评估,将数据分成两份,一份训练,另一份,验证
# validatioin 验证
X_train,X_validation,y_train,y_validation = train_test_split(X,y,test_size = 0.2)
a = np.array(list('ABCDEFHIJK'))
b = np.arange(10)
print(a,b)
a_train,a_test,b_train,b_test = train_test_split(a,b,test_size = 0.2)
print(a_train,a_test)
print(b_train,b_test)
['A' 'B' 'C' 'D' 'E' 'F' 'H' 'I' 'J' 'K'] [0 1 2 3 4 5 6 7 8 9]
['A' 'J' 'H' 'B' 'C' 'F' 'E' 'K'] ['D' 'I']
[0 8 6 1 2 5 4 9] [3 7]
index = np.arange(10)
np.random.shuffle(index)#洗牌,打乱顺序
a_train,a_test = a[index[:8]],a[index[8:]]
b_train,b_test = b[index[:8]],b[index[8:]]
print(a,b)
print(a_train,a_test)
print(b_train,b_test)
['A' 'B' 'C' 'D' 'E' 'F' 'H' 'I' 'J' 'K'] [0 1 2 3 4 5 6 7 8 9]
['B' 'J' 'F' 'I' 'E' 'C' 'K' 'H'] ['D' 'A']
[1 8 5 7 4 2 9 6] [3 0]
使用普通线性模型
linear = LinearRegression()
linear.fit(X_train,y_train)
# 使用训练好的模型,验证保留验证数据
# 预测,返回值
y_ = linear.predict(X_validation)
# 验证数据,标准y_validation
# 对比y_(算法预测)和标准答案y_validation
# ((y_validation - y_)**2).mean()
print('线性模型mse:',mean_squared_error(y_validation,y_))
# 使用线性模型,预测测试数据,这个数据,提交给阿里巴巴
y_commit = linear.predict(test)
s = pd.Series(y_commit)
s.to_csv('./linear_result.txt',index = False,header = False)
线性模型mse: 0.11382567198052725
使用改良的线性回归模型:岭回归
ridge = Ridge(alpha=1024.0)
ridge.fit(X_train,y_train)
y_ = ridge.predict(X_validation)
print(mean_squared_error(y_validation,y_))
y_commit = ridge.predict(test)
pd.Series(y_commit).to_csv('./ridge_result1024.txt',index = False,header = False)
0.13413963054774622
ridge = Ridge(alpha=128.0)
# 未经拆分的数据
ridge.fit(X,y)
y_ = ridge.predict(test)
pd.Series(y_).to_csv('./ridge128.txt',index = False,header = False)
使用Lasso回归,进行预测
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.01)
lasso.fit(X_train,y_train)
y_ = lasso.predict(X_validation)
print('Lasso回归,均方误差MSE:',mean_squared_error(y_validation,y_))
y_commit = lasso.predict(test)
pd.Series(y_commit).to_csv('./lasso1.txt',index = False,header = False)
Lasso回归,均方误差MSE: 0.12943928004461872
lasso.coef_
array([ 0., 0., 0., 0., 0., -0., 0., 0., 0., 0., 0., -0., 0.,
0., 0., 0., 0., 0., 0., -0., 0., -0., -0., 0., -0., -0.,
-0., 0., 0., 0., 0., 0., 0., 0., -0., 0., 0., -0.])
lasso.intercept_
0.12075281385281386
(二、三种算法实现天池工业蒸汽的对比)
import numpy as np
from sklearn.linear_model import LinearRegression,Ridge,Lasso
import matplotlib.pyplot as plt
构造假数据
# 正太分布
# 50 样本
# 200 属性,每个属性都有一个权重
X = np.random.randn(50,200)
X
array([[ 0.78051258, 0.63267941, 1.1337401 , ..., 0.71531292,
0.20262926, 0.91934459],
[ 0.1067049 , -0.36223901, 0.45761022, ..., -0.10926697,
0.93214121, -0.92820774],
[-0.65732255, -1.64773933, 2.59432257, ..., 0.78834919,
-0.35321914, 1.14903081],
...,
[-0.41289195, -0.48831277, -0.60751417, ..., -0.28837338,
0.25960781, 1.06648307],
[-2.74200689, -0.10602526, 0.29969537, ..., -0.54048529,
-1.22146552, 0.1785988 ],
[ 0.13439121, 0.92725871, 0.29976546, ..., -0.59037103,
1.65043998, -0.68813205]])
# 标准答案
w = np.random.randn(200)
index = np.arange(200)
np.random.shuffle(index)
w[index[:190]] = 0
w
array([ 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 1.07161199, 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , -0.43239108, 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , -1.44490872, 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 1.25708765, 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0.89463052,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , -0.55073074,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 1.28760209, 0. , 0. ,
0. , 0. , 0. , 0. , 0.73280569,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0.10173928, 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0.76143787, 0. ])
# 矩阵乘法,没有加截距
y = X.dot(w)
y
array([-1.22133526e+00, 3.81725494e-01, -1.56595434e+00, 1.45695054e+00,
5.15477205e+00, -1.59518568e+00, -3.20930853e+00, 2.17847962e+00,
-1.87637977e+00, -4.39554926e-01, -3.84660114e+00, -1.36209334e+00,
3.19924642e+00, 9.44274282e-01, 2.69926791e+00, -7.49354499e-01,
-6.92402394e-01, -3.94947098e+00, 4.49125683e+00, 1.88452918e-01,
-7.50513128e-01, -2.40339195e+00, 6.28120503e-03, -2.82072858e+00,
2.64752903e+00, -3.51293654e-01, 4.57071679e+00, -1.02870109e+00,
-4.19601476e-01, 2.84924675e-02, -1.69104100e+00, -7.59601797e-01,
6.83983349e+00, 3.85890310e+00, 2.31726171e-01, 2.99303146e+00,
2.68688741e+00, -4.03967828e-01, -1.44785223e+00, 1.75793283e+00,
1.95071758e-01, 3.90593411e+00, -9.18510766e-01, -2.62779060e+00,
-1.45807278e+00, -7.90697637e-01, -2.54302379e+00, 4.85714957e+00,
-5.03390178e-01, 3.95259426e+00])
数据X和y之间的关系,是w
使用算法计算w
plt.plot(w)
[<matplotlib.lines.Line2D at 0x220ac36c988>]
linear = LinearRegression(fit_intercept=False) #没有截距
linear.fit(X,y)
plt.plot(linear.coef_)
plt.plot(w,color = 'red')
[<matplotlib.lines.Line2D at 0x220ac519ec8>]
# 岭回归相对线性回归,实现系数收缩
ridge = Ridge(alpha= 100.0,fit_intercept=False)
ridge.fit(X,y)
plt.plot(ridge.coef_)
plt.plot(w,color = 'red')
[<matplotlib.lines.Line2D at 0x220ac570208>]
# 套索回归,使用于,稀松矩阵,大部分系数是0,只有一小部分系数实数
# 数据假数据w中190个是0,10个实数
lasso = Lasso(alpha=0.1,fit_intercept=False)
lasso.fit(X,y)
plt.plot(lasso.coef_)
plt.plot(w,color = 'red',alpha = 0.3)
[<matplotlib.lines.Line2D at 0x220ac623948>]
三个算法总结如下:
算法总结:
- 线性回归
- 岭回归Ridge
- 套索回归Lasso
- 三个算法,同一家族的
- 岭回归和套索回归,防止过拟合
- 岭回归 + 二阶正则项 来防止过拟合
- 套索回归+一阶正则项 来防止过拟合
- batch梯度下降和随机梯度下降以及mini-batch梯度下降
- 只是采样多少不同
- 梯度下降,计算梯度,所有样本,多考虑进去(计算时间长)
- 随机梯度下降,计算梯度时,随机抽取一个样本,计算时间大大缩短
- mini-batch梯度下降,中间位置,选取一部分,样本总数(500个),选取10或者30
- 梯度下降本质上来说,是优化算法
- 梯度下降和线性回归、Ridge回归、Lasso回归关系
- 使用梯度下降来实现线性回归、Ridge回归、Lasso回归
- 线性回归、Ridge回归和Lasso回归各自的原理公式
- 如何使用代码实现,梯度下降,根据求导公式,具体实现三个算法。