sklearn(二): 线性回归预测

1. 产生样本

这里使用sklearn的dataset模块产生样本。(为了可视化，这里假设n_features=1, 以后的模型训练中我们将使用n_features=5)。

from sklearn import datasets
import matplotlib.pyplot as plt

X, y = datasets.make_regression(
    n_samples=200,
    n_features=1,
    n_informative=1,
    n_targets=1,
    noise=10,
    bias=1,
    random_state=1
)
# print(X.shape)

plt.scatter(X[:, 0], y)
plt.show()

效果如下：

Linear Model

2. 搭建线性模型

首先，对数据进行重新洗牌，并且分隔train set和test set。(这里共有200个数据点，150个用于训练，剩余50个用于训练）

import numpy as np
np.random.seed(1)
n = X.shape[0]  # number of training samples
permutation = np.random.permutation(n)
X, y = X[permutation], y[permutation]
X_train, y_train = X[:150, :], y[:150]
X_test, y_test = X[150:, ], y[150:]

然后，建立linear_regressor,其常用方法有fit和predict。

# train
linear_regressor = linear_model.LinearRegression()
linear_regressor.fit(X_train, y_train)
y_predict_train = linear_regressor.predict(X_train)
# print(y_train[:10])
# print(y_predict_train[:10])

之后，在test set上进行预测

y_predict_test = linear_regressor.predict(X_test)

3.模型评价

首先，我们可以计算train_loss和test_loss进行一个简单的评价。(这里loss计算采用均方误差）

loss_train = np.sum((y_predict_train - y_train)**2, axis=0) / (2*X_train.shape[0])
print("loss on the train set: ", loss_train)
loss_test = np.sum((y_test - y_predict_test)**2, axis=0) / (2*X_test.shape[0])
print("loss on the test set: ", loss_test)

结果为：

loss on the train set:  12.7823020309
loss on the test set:  12.5362233581

我们还可以使用sklearn提供的metrics模块对模型进行更全面的评价

import sklearn.metrics as sm

print("Mean absolute error: ", round(sm.mean_absolute_error(y_test, y_predict_test), 2))
print("Mean squared error: ", round(sm.mean_squared_error(y_test, y_predict_test), 2))
print("Median absolute error: ", round(sm.median_absolute_error(y_test, y_predict_test), 2))
print("Explained variance error: ", round(sm.explained_variance_score(y_test, y_predict_test), 2))
print("R2 score: ", round(sm.r2_score(y_test, y_predict_test), 2))

结果如下：

Mean absolute error:  4.26
Mean squared error:  25.07
Median absolute error:  4.32
Explained variance error:  0.99
R2 score:  0.99

对前三个指标不进行过多解释。Explained variance error是指解释性方差，用于衡量我们的模型对抗波动能力，最高得分为1.0。 R2 score是指确定性相关系数，衡量模型对未知样本的预测效果，最高得分也为1.0。

模型评价很难做到面面俱到。对于回归模型，我们一般希望均方误差最低，解释方差得分最高。

4 保存模型

使用pickle模块保存我们的训练结果，方便下次使用。

import pickle
# save
output_model_file = "saved_linear_model.pkl"
with open(output_model_file, "wb") as f:
    pickle.dump(linear_regressor, f)

# reuse
with open(output_model_file, "rb") as f:
    model_linear = pickle.load(f)

y_test_predict = model_linear.predict(X_test)

5 补充

如果你想重现以上结果，请使用如下数据产生代码。

X, y = datasets.make_regression(
n_samples=200,
n_features=5,
n_informative=3,
n_targets=1,
noise=5,
bias=1,
random_state=1)

最后编辑于：2018.02.02 22:42:56