机器学习 Linear Regression Tricks 实例详解-V1.2

本文对 A study on Regression applied to the Ames dataset 中的代码进行详解。

该段代码的主体介绍如下：
Introduction
该 kernel 将使用各种技巧来全面体现 Linear Regression 的作用，包括预处理和 regularization( a process of introducing additional information in order to prevent overfitting).

具体算法流程

1. 导入数据

（如需要数据集的同事，可在网页链接下载）

import 工具包。
matplotlib 是最著名的 Python 图表绘制扩展库，它支持输出多种格式的图形图像，并且可以使用多种 GUI（图形用户界面，即 Graphical User Interface）界面库交互式地显示图表。使用 %matplotlib 命令可以将 matplotlib 的图表直接嵌入到 Notebook 之中，或者使用指定的界面库显示图表，参数指定 matplotlib 图表的显示方式。inline 表示将图表嵌入到Notebook中。

问题： IPython 内置了一套非常强大的指令系统，又被称作魔法命令，使得在IPython环境中的操作更加得心应手。魔法命令都以%或者%%开头，以%开头的成为行命令，%%开头的称为单元命令。行命令只对命令所在的行有效，而单元命令则必须出现在单元的第一行，对整个单元的代码进行处理。MORE TO SEE...

读入数据。
建议可以将 .csv 文件用 excel 打开，后续可作数据上的人为调整，观察不同的调整对代码结果的影响。
train = pd.read_csv("../input/train.csv")
设置数据格式：
如下为为设置显示的浮点数位数，保留三位小数。
pd.set_option('display.float_format', lambda x: '%.3f' % x)
检查是否有重复数据，并去除 ID 列。
检查是否有重复数据：
```
 idsUnique = len(set(train.Id))
 idsTotal = train.shape[0]
 idsDupli = idsTotal - idsUnique
 print("There are " + str(idsDupli) + " duplicate IDs for " + str(idsTotal) + " total entries")
```
其中, train.shape 返回的是（1460，80），即行列的数目； train.shape[0] 返回的是行的数目。

在本数据集中没有重复数据，然而如果在其他数据集中要做重复数据的处理，建议使用 DataFrame.drop_duplicates(subset=None, keep='first', inplace=False) Return DataFrame with duplicate rows removed, optionally only considering certain columns.

其中，如上几个参数解释如下：
subset : column label or sequence of labels, optional.Only consider certain columns for identifying duplicates, by default use all of the columns。选择要作用于的列。
keep : {‘first’, ‘last’, False}, default ‘first’.first : Drop duplicates except for the first occurrence. last : Drop duplicates except for the last occurrence. False : Drop all duplicates.
inplace : boolean, default False. Whether to drop duplicates in place or to return a copy. 如果选的是 True 则在原来 dataframe 上直接修改，否则就返回一个删减后的 copy。

去除 ID 列数据：
train.drop("Id", axis = 1, inplace = True)

其中几个参数的意思分别是：
“ID” 为列名。
axis = 1 表明是列；如果是 0 ，则表明是行。
inplace = True：凡是会对原数组作出修改并返回一个新数组的，往往都有一个 inplace可选参数。如果手动设定为True（默认为False），那么原数组直接就被替换。

2. Pre-processing

2.1 去除异常值 Potential Pitfalls/Outliers

image.png

train = train[train.GrLivArea < 4000] # 去除右侧的异常点

2.2 Take log 以消减误差

取对数是为了均衡误差对不同价位房屋的价格预测值的影响。

train.SalePrice = np.log1p(train.SalePrice)
y = train.SalePrice

这里取 log 时采用的是 log1p 即 log(1+x).

两个问题：

take log 的作用:
Small values that are close together are spread further out.
Large values that are spread out are brought closer together.

natural log.png

take log(1+x) 的作用：
朴素贝叶斯中，防止变量之前从未出现的时候，出现的概率为 0 ，出现数学计算的错误。

2.3 Handle missing values

处理不可使用中位数或平均数或 most common value 进行填充的 feature 的缺失值。

替换数据的依据：
根据 label 判断该 feature 下缺失值最有可能是什么，就填入什么。

具体来说，要深入去看原始数据集：

如果 values 中有等级划分（优劣差等各等级；2、1、0 或 Y/N），一般选择最低等级的一类作为填充值。
如果 values 中为类型划分，则选择该 features 下最经常出现的值作为填充值。
```
train.loc[:, "Alley"] = train.loc[:, "Alley"].fillna("None")
```

其中 train.loc[:, "Alley"] means select every row of column "alley".， .fillna(XX) means fill na cell with XX.

2.4 特殊数据的处理

2.4.1 将 numerical features 转为 categories

Some numerical features 事实上是类型值, 要把它们转化成类别。比如月份的数字本身无任何数值意义，所以转换为英文缩写。

train = train.replace({"MSSubClass" : {20 : "SC20", 30 : "SC30", 40 : "SC40", 45 : "SC45",  50 : "SC50", 60 : "SC60", 70 : "SC70", 75 : "SC75",  80 : "SC80", 85 : "SC85", 90 : "SC90", 120 : "SC120",  150 : "SC150", 160 : "SC160", 180 : "SC180", 190 : "SC190"}, 
"MoSold" : {1 : "Jan", 2 : "Feb", 3 : "Mar", 4 : "Apr", 5 : "May", 6 : "Jun", 7 : "Jul", 8 : "Aug", 9 : "Sep", 10 : "Oct", 11 : "Nov", 12 : "Dec"}})

2.4.2 将 category features 转为 ordered numbers

将一些 categorical features 转换为 ordered numbers，

有明确分级的 feature，这些数值的顺序本身是有信息的，比如，"BsmtQual" : {"No" : 0, "Po" : 1, "Fa" : 2, "TA": 3, "Gd" : 4, "Ex" : 5}；
分类关系中分级关系可以比较明确区分出来，如 Alley ，大多数人都偏好铺好的平整路面，而不是碎石路；LotShape，大多数人都偏好规整的 LotShape。反例则是比如 neighborhood 虽然可以反应出分级，毕竟大多数人喜欢的社区还是相似的，但是很难区分。

    train = train.replace({"Alley" : {"Grvl" : 1, "Pave" : 2},
                   "BsmtCond" : {"No" : 0, "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5},
                ……）

3. Create new features

Then we will create new features, in 3 ways :

Simplifications of existing features
Combinations of existing features
Polynomials on the top 10 existing features

之所以要对 features 作进一步的处理，我认为原因是：1. 简化后续计算，专注核心的 features。

3.1 简化 features 1* Simplifications of existing features

第一种简化 features 的方法，即简化已有的 features 数据层级，比如如下将原来 9 级的数据（1–9）可以划分为 3 级（1–3）。

train["SimplOverallQual"] = train.OverallQual.replace(
                                                  {1 : 1, 2 : 1, 3 : 1, # bad
                                                   4 : 2, 5 : 2, 6 : 2, # average
                                                   7 : 3, 8 : 3, 9 : 3, 10 : 3 # good
                                                  })

3.2 简化 features 2* Combinations of existing features

第二种简化 features 的方法，即将若干种紧密相关的 features 合并在一起。

可能用到的讲解视频 Multivariate Linear Regression - Features and Polynomial Regressions - Housing Prices Predicting , given by Andrew NG, Stanford University。要注意的是，采用这种方法要格外注意 scaling。

Features Choice.png

具体语法（示例）：

train["OverallGrade"] = train["OverallQual"] * train["OverallCond"]

3.3 简化 features 3* Polynomials on the top 10 existing features

3.3.1 寻找重要 features

Find most important features relative to target. 按照其他 features 和 SalePrice 的相关度 correlation 降序排列。

corr = train.corr()
corr.sort_values(["SalePrice"], ascending = False, inplace = True)

3.3.2 Features and Polynomial Regressions

关于这一步，个人的理解是：先将重要的 features 挑选出来，然后为了更好地拟合某个模型，将这些重要的模型做了一个 Polynomial Regressions 的处理。

关于何时使用 polynomial，有如下三个 situations：

理论需求。即作者假设这里会由曲线构成。
人为观察变量。在做回归分析之前，要先做单变量或二变量观察。可以通过画简单的散点图来检验是否有曲线关系。
对残差的观察。如果你试图将线性模型应用在曲线关系的数据上，那么散点图中间会在中间区域有整块的正值残差，然后在横坐标（X 轴即
predictor）一末端有整块的负值残差；或者反之。这就说明了线性模型是不适用的。但我个人觉得，第三种方式只是不适用线性模型的若干种情况之一，非充要条件。

代码示例：

train["OverallQual-s2"] = train["OverallQual"] ** 2
train["OverallQual-s3"] = train["OverallQual"] ** 3
train["OverallQual-Sq"] = np.sqrt(train["OverallQual"])

同样的，用到的讲解视频 Multivariate Linear Regression - Features and Polynomial Regressions - Housing Prices Predicting , given by Andrew NG, Stanford University。要注意的是，采用这种方法要格外注意 scaling。

有趣的是，执行了上述代码之后，重新将影响 SalePrice 的 features 排序后，新生成的 features 进入了 influential features top 10，可见 polynomial 是有意义的。

Features and Polynomial Regressions.png

4. Create Features 后的数据再处理

4.1 处理仍遗留的缺失数据

4.1.1 区分出 numerical 和 categorical features

除了目标 feature 即 SalePrice 之外，区分出 numerical 和 categorical features 。

categorical_features = train.select_dtypes(include = ["object"]).columns
numerical_features = train.select_dtypes(exclude = ["object"]).columns
numerical_features = numerical_features.drop("SalePrice")

其中 object 是指 categorical features 的数据类型。

4.1.2 缺失数据填充

对于 numerical features 中的缺失值，使用中位数作为填充值。

train_num = train_num.fillna(train_num.median())

4.2 Take Log

对 skewed numerical features 取 Log 变换可以弱化异常值的影响。

Inspired by Alexandru Papiu's script.
As a general rule of thumb, a skewness with an absolute value > 0.5 is considered at least moderately skewed.

skewness = train_num.apply(lambda x: skew(x))
skewness = skewness[abs(skewness) > 0.5]
skewed_features = skewness.index
train_num[skewed_features] = np.log1p(train_num[skewed_features])

4.3 Create dummy features for categorical values

Create dummy features for categorical values via one-hot encoding.

OneHotEncoder.png

在回归分析中，dummy 变量（也被称为 indicator variable, design variable, Boolean indicator, categorical variable, binary variable, or qualitative variable）取值为 0 或 1，意味着是否会有可能影响输出的 categorical effect。Dummy 变量常用于分类互斥分类（比如吸烟者/非吸烟者）。

train_cat = [pd.get_dummies(train_cat)](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html)

get_dummies 作用于 categorical features，举例来说明 get_dummies 的作用：

在执行 get_dummies 操作之前，每一列都是一个feature，这一列内取值都是原始数据或处理后的原始数据，对于 feature MSSubClass 的取值可分为 SC20/SC60/ SC70...SC120...等，其中第 23 行的数据记录为 SC120。
在执行 get_dummies 操作之后，每一列的列名依次变为每一个原来 feature 的取值，比如原来的 MSSubClass 列会拓展为 SC20/SC60/ SC70...SC120...等。以 SC120 举例，原来第 23 行记录为 SC120, 那么对应修改后新增的 SC120 列中第 23 行值为 1；原来若干行记录不是 SC120 的，对应变换后值为 0.

5. Modeling

5.1 数据整合与 validation

5.1.1 数据整合

首先进行数据的整合，将原来分别处理完毕的 numerical 和 categorical features 进行合并。

train = pd.[concat([train_num, train_cat], axis = 1)](http://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.concat.html)

5.1.2 分割数据集

将数据集分割为 train set 和 validation set。事实上，在事后作者也说，对于 cross-validation 而言，没必要提前分割数据集，因为 cross-validation 会自动分割。

X_train, X_test, y_train, y_test = train_test_split(train, y, test_size = 0.3, random_state = 0)

关于 train_test_split(train, y, test_size = 0.3, random_state = 0)，它的作用是随机分割 arrays or matrices 为 train and test subsets。

5.1.3 standardize numerical features

Reason why we need standardization for numerical features: For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

翻译如下：我们需要对 numerical features 进行 standardization 的原因是：举例来说，很多机器学习算法(such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models)会假设所有的 features 均值为0，方差都是在同一 order。如果一个 feature 的方差的 order 在数量级上大于其他 features，那么它将主导影响这个算法方程，使得预测器不能按照预期也从其他 features 中学习。

stdSc = StandardScaler()    
X_train.loc[:, numerical_features] = stdSc.fit_transform(X_train.loc[:, numerical_features])
X_test.loc[:, numerical_features] = stdSc.transform(X_test.loc[:, numerical_features])

其中，StandardScaler()： Standardize features by removing the mean and scaling to unit variance，返回的是执行完 standardization 之后的数据。

fit.png

fit transform.png

transform.png

image.png

简言之，fit_transform = fit + transform。 fit 函数会计算 X_train 的平均值和方差，然后根据 fit 的计算结果，transform 函数会执行 standardization 操作。此时的平均值和方差结果已经记录下来，第二次对 X_test 操作时，不新计算 X_test 的平均值和方差，而沿用 X_train 的平均值和方差，直接调用 transform 函数进行 standardization 操作。

要注意的是：standardization （fit 函数）不能在数据分割（training 与 test/validation set）之前使用，因为我们不希望使得 StandardScaler 也计算 test set 的平均值和方差，我们希望 test set 使用和 train set 相同的平均值和方差。

5.1.4 Define error measure

Define error measure for official scoring : RMSE

scorer = make_scorer(mean_squared_error, greater_is_better = False)

其中，make_scorer makes a scorer from a performance metric or loss function. 第一个参数确定的是 function 的类型（这里指定的是一个 loss function）, 第二个参数指明的是第一个参数是 score function（default=True）或 loss function（=False）。

这里作者采用了两个方程式分别定义 train set rmse 和 test set rmse，事实上，我个人认为这是作者的谬误。换言之，对于 cross-validation 而言，读入的整个数据集在不同的 iteration 中分别都充当了 train 及 test set，没必要分别计算 train set rmse 和 test set rmse。

def rmse_cv_train(model):
rmse= np.sqrt(-cross_val_score(model, X_train, y_train, scoring = scorer, cv = 10)
return(rmse)

def rmse_cv_test(model):
rmse= np.sqrt(-cross_val_score(model, X_test, y_test, scoring = scorer, cv = 10))
return(rmse)

其中，此处的 cross_val_score 中的 scorer 是 loss function，返回值应该是负值，所以执行 sqrt 之前要加一个负号使得 sign-flip the outcome of the scorer；

如果此处的 scorer 是 score function，则返回值是 scores : array of float, shape=(len(list(cv)),)，即 Array of scores of the estimator for each run of the cross validation.

Cross Validation K-Fold Instrucions.png

BTW, Trust your CV score, and not LB score. The leaderboard score is scored only on a small percentage of the full test set. In some cases, it’s only a few hundred test cases. Your cross-validation score will be much more reliable in general.

5.2 *1. Linear Regression without regularization

lr = LinearRegression()
lr.fit(X_train, y_train)
y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)

其中， LinearRegression()：Ordinary least squares Linear Regression.

通过画图可以查看预测结果：画 residulas、画原始预测值

plt.scatter(y_train_pred, y_train_pred - y_train, c = "blue", marker = "s", label = "Training data") # scatter 是散点图
plt.scatter(y_test_pred, y_test_pred - y_test, c = "lightgreen", marker = "s", label = "Validation data")plt.scatter(y_train_pred, y_train, c = "blue", marker = "s", label = "Training data")
plt.scatter(y_test_pred, y_test, c = "lightgreen", marker = "s", label = "Validation data")

5.3 *2. Linear Regression with Ridge regularization (L2 penalty)

Regularization is a very useful method to handle collinearity, filter out noise from data, and eventually prevent overfitting. The concept behind regularization is to introduce additional information (bias) to penalize extreme parameter weights.The goal of this learning problem is to find a function that fits or predicts the outcome (label) that minimizes the expected error over all possible inputs and labels.

image.png

L1 penalty：
absolute sum of weights

L2 penalty：

Ridge regression is an L2 penalized model where we simply add the squared sum of the weights to our cost function.

对于 L1 penalty and L2 penalty 此处的区别, 可进一步参见该文章中的 As Regularization/loss function 部分. 林轩田机器学习基石 14-4 General Regularizers（13-28）有具体讲解。

5.3.1 寻找合适的 Ridge Regression Model

第一次寻找

ridge = RidgeCV(alphas = [0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, 3, 6, 10, 30, 60])     
ridge.fit(X_train, y_train)
alpha = ridge.alpha_ # Estimated regularization parameter.

其中，RidgeCV(): Ridge regression with built-in cross-validation.

** 第二次寻找**
要注意的一点是：在第一次得到 alpha 后要在用该 alpha 做一个中间值再取一次 alpha。

ridge = RidgeCV(
alphas = [alpha * .6, alpha * .65, alpha * .7, alpha * .75, alpha * .8, alpha * .85,  alpha * .9, alpha * .95, alpha, alpha * 1.05, alpha * 1.1, alpha * 1.15, alpha * 1.25, alpha * 1.3, alpha * 1.35, alpha * 1.4],  cv = 10)
ridge.fit(X_train, y_train)
alpha = ridge.alpha_

问题：为何第二次 RidgeCV 新增了参数 cv=10，而第一次没有。

5.3.2 用合适的 Ridge Model 计算预测值

先检测 RMSE。我个人认为，如果这里的 RMSE.mean() 值不合适，应该会回退到之前的步骤进行微调。但具体值范围多少是不合格则需要查看原始数据的 range，比如 For a datum which ranges from 0 to 1000, an RMSE of 0.7 is small, but if the range goes from 0 to 1, it is not that small anymore，所以事实上并没有一个绝对确定的值域。

如果 RMSE 不合适，我个人认为应首先回退调整 alpha 等其他参数，如果仍不合适，则要查验数据，比如 polynomial 项、features 的合并简化等。在调整的过程中，一次只应调整一个或有很强关系的系列参数。

print("Ridge RMSE on Training set :", rmse_cv_train(ridge).mean())
print("Ridge RMSE on Test set :", rmse_cv_test(ridge).mean())

如果 RMSE 合适，计算预测值。

y_train_rdg = ridge.predict(X_train)
y_test_rdg = ridge.predict(X_test)

5.3.3 画图检验 Ridge Model 结果

画 residulas（预测值减去真实值）：同时画 y_train_rdg, y_test_rdg.

plt.scatter(y_train_rdg, y_train_rdg - y_train, c = "blue", marker = "s", label = "Training data")
plt.scatter(y_test_rdg, y_test_rdg - y_test, c = "lightgreen", marker = "s", label = "Validation data")

事实上，我们要读懂 residulas 图，一个好的 residuals 图有如下特征：

(1) they’re pretty symmetrically distributed, tending to cluster towards the middle of the plot，呈现对称分布的特点，趋向于往图片中间聚集。
(2) they’re clustered around the lower single digits of the y-axis (e.g., 0.5 or 1.5, not 30 or 150) 聚集在 y 轴的小数值区域。
(3) in general there aren’t clear patterns 一般不会呈现特定图形模式。

residulas.png

Interpreting residual plots to improve your regression 这篇较为详细地描述了 residual plot 背后的含义，并给出了 how to fix 的建议。

为了表明数据是否贴近于我们所选的模型，我们常使用的一个概念是 R-Squared. R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. 0% indicates that the model explains none of the variability of the response data around its mean. In general, the higher the R-squared(取值范围是 [0, 1]), the better the model fits your data. However, there are important conditions for this guideline that I’ll talk about both in this post and my next post.

具体来说：

如果是 Y-axis Unbalanced，

Y-axis Unbalanced.png

解决方法是：
The solution to this is almost always to transform your data(一般指 take log), typically your response variable.
It’s also possible that your model lacks a variable.

如果是呈现了非线性，

non-linear.png

要注意的是：
如果稍好一些的模型拟合如下，

suboptimal.png

对于这样的模型，If you’re getting a quick understanding of the relationship, your straight line is a pretty decent approximation. If you’re going to use this model for prediction and not explanation, the most accurate possible model would probably account for that curve.

解决方法是：
Sometimes patterns like this indicate that a variable needs to be transformed.
If the pattern is actually as clear as these examples, you probably need to create a nonlinear model (it’s not as hard as that sounds).
Or, as always, it’s possible that the issue is a missing variable.

outliers

outliers.png

解决方案：

It’s possible that this is a measurement or data entry error, where the outlier is just wrong, in which case you should delete it.
It’s possible that what appears to be just a couple outliers is in fact a power distribution. Consider transforming the variable if one of your variables has an asymmetric distribution (that is, it’s not remotely bell-shaped).
If it is indeed a legitimate outlier, you should assess the impact of the outlier.

Large Y-axis Datapoints

Large Y-axis Datapoints.png

解决方案：
Even though this approach wouldn’t work in the specific example above, it’s almost always worth looking around to see if there’s an opportunity to usefully transform a variable.
If that doesn’t work, though, you probably need to deal with your missing variable problem.

X-axis Unbalanced

X-axis Unbalanced.png

这种图形不一定说明模型的预测能力不佳，可以 look at the Predicted vs Actual plot ，如果拟合良好，这也是有可能的（residuals are unbalanced but predictions are accurate）；如果采用一些步骤微调之后，预测能力变得更差也是有可能的。

解决方案
The solution to this is almost always to transform your data, typically an explanatory variable. (Note that the example shown below will reference transforming your reponse variable, but the same process will be helpful here.)
It’s also possible that your model lacks a variable.

直接画预测值：同时画 y_train_rdg, y_test_rdg.

plt.scatter(y_train_rdg, y_train, c = "blue", marker = "s", label = "Training data")
plt.scatter(y_test_rdg, y_test, c = "lightgreen", marker = "s", label = "Validation data")

画重要的参数：As with other linear models, Ridge
will take in its fit method arrays X, y and will store the coefficients [w] of the linear model in its coef_ member:

coefs = pd.Series(ridge.coef_, index = X_train.columns) #ridge.coef_ 即为 w
imp_coefs = pd.concat([coefs.sort_values().head(10),
                 coefs.sort_values().tail(10)])
imp_coefs.plot(kind = "barh")

5.4 * 3. Linear Regression with Lasso regularization (L1 penalty)

LASSO 是 Least Absolute Shrinkage and Selection Operator 的缩写。这是另一种可选的 regularization 方式，我们可以将 Ridge 方法中取 weights 的平方和替换为取 weights 的绝对值和。不同于 L2 regularization， L1 regularization 输出 sparse feature vectors，即大多数的 feature weights 都是 0。Sparsity 这种特性（大多数的 feature weights 都是 0）在实际中是十分有用的，尤其对于有很多维度且很多 features 之间无关联的数据集。

5.4.1 寻找合适的 Lasso Regression Model

同 Ridge Model，在寻找合适的 Lasso Regression Model 也需要进行两次寻找。

** 第一次寻找**

lasso = LassoCV(alphas = [0.0001, 0.0003, 0.0006, 0.001, 0.003, 0.006, 0.01, 0.03, 0.06, 0.1, 
                      0.3, 0.6, 1], 
            max_iter = 50000, cv = 10)
lasso.fit(X_train, y_train)
alpha = lasso.alpha_

第二次寻找

问题：同 Riege Model，在第二次寻找合适的模型时也新增了 cv=10 这个参数；不同的是，在使用 Lasso Model 时，额外新增了 max_iter=5000。

lasso = LassoCV(alphas = [alpha * .6, alpha * .65, alpha * .7, alpha * .75, alpha * .8,  alpha * .85, alpha * .9, alpha * .95, alpha, alpha * 1.05,  alpha * 1.1, alpha * 1.15, alpha * 1.25, alpha * 1.3, alpha * 1.35, alpha * 1.4], max_iter = 50000, cv = 10)
lasso.fit(X_train, y_train)
alpha = lasso.alpha_

5.4.2 用合适的 Lasso Model 计算预测值

与 Ridge 模型相同，先 print 出 RMSE。

print("Lasso RMSE on Training set :", rmse_cv_train(lasso).mean())
print("Lasso RMSE on Test set :", rmse_cv_test(lasso).mean())

计算预测值

y_train_las = lasso.predict(X_train)
y_test_las = lasso.predict(X_test)

5.4.3 画图检验 Lasso Model 结果

同 Ridge Model，此处也分为三步

画 residuals（预测值减去真实值）：同时画 y_train_las, y_test_las.
直接画出预测值：同时画 y_train_las, y_test_las.
画重要的参数

总结比较 Lasso 和 Ridge Model： Lasso 的 RMSE 结果子 training 和 test sets 上都表现更好。值得注意的是： Lasso 仅仅用了可用 features 的三分之一；另一点值得注意的是， Lasso 似乎给 neighborhood categories 这个 feature 更大的权重比，而直觉来说，neighborhood 的确对房屋售价骑着非常关键的作用。

5.5 * 4. Linear Regression with ElasticNet regularization (L1 and L2 penalty)

ElasticNet 是 Ridge 和 Lasso regression 的折中方案。它有 L1 penalty 来形成 sparsity，也有 L2 penalty 来客服 lasso 的一些限制，比如变量个数（(Lasso can't select more features than it has observations, but it's not the case here anyway）

5.5.1 寻找合适的 ElasticNet Model

第一次寻找

elasticNet = ElasticNetCV(l1_ratio = [0.1, 0.3, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.95, 1],
                      alphas = [0.0001, 0.0003, 0.0006, 0.001, 0.003, 0.006, 
                                0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, 3, 6], 
                      max_iter = 50000, cv = 10)
elasticNet.fit(X_train, y_train)
alpha = elasticNet.alpha_
ratio = elasticNet.l1_ratio_

其中 l1_ratio 是 float between 0 and 1 passed to ElasticNet (scaling between l1 and l2 penalties).
For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.
This parameter can be a list, in which case the different values are tested by cross-validation and the one giving the best prediction score is used. Note that a good choice of list of values for l1_ratio is often to put more values close to 1 (i.e. Lasso) and less close to 0 (i.e. Ridge), as in [.1, .5, .7, .9, .95, .99, 1]

第二次寻找(针对 ratio)

暂时利用之间的 alpha 值，在上一次寻找得到的 ratio 值的上下一定范围内取 ratio 值，进行第二次寻找。

elasticNet = ElasticNetCV(l1_ratio = [ratio *(乘以) .85, ratio * .9, ratio * .95, ratio, ratio * 1.05, ratio * 1.1, ratio * 1.15],
                      alphas = [0.0001, 0.0003, 0.0006, 0.001, 0.003, 0.006, 0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, 3, 6], 
                      max_iter = 50000, cv = 10)
elasticNet.fit(X_train, y_train)
alpha = elasticNet.alpha_
ratio = elasticNet.l1_ratio_

其中有一些细节要注意：elasticNet.l1_ratio_ 的取值范围应该是 [0, 1]，所以不可能超过这个范围。如果超过，要折回最接近的取值范围内，即置为 0 或 1。

第三次寻找(针对 alpha)

利用已经得到的 ratio 值，在第一次寻找到的 alpha 的上下一定范围内取 alpha 值，进行第三次寻找。

elasticNet = ElasticNetCV(l1_ratio = ratio,
                     alphas = [alpha * .6, alpha * .65, alpha * .7, alpha * .75, alpha * .8, alpha * .85, alpha * .9, 
                                alpha * .95, alpha, alpha * 1.05, alpha * 1.1, alpha * 1.15, alpha * 1.25, alpha * 1.3, 
                                alpha * 1.35, alpha * 1.4], 
                      max_iter = 50000, cv = 10)
elasticNet.fit(X_train, y_train)
alpha = elasticNet.alpha_
ratio = elasticNet.l1_ratio_

同上一步寻找合适的 Model，有一些细节要注意：elasticNet.l1_ratio_ 的取值范围应该是 [0, 1]，所以不可能超过这个范围。如果超过，要折回最接近的取值范围内，即置为 0 或 1。

5.5.2 用合适的 ElasticNetCV Model 计算预测值

与 Ridge、Lasso 模型相同，先 print 出 ElasticNetCV 的 RMSE。

print("ElasticNet RMSE on Training set :", rmse_cv_train(elasticNet).mean())
print("ElasticNet RMSE on Test set :", rmse_cv_test(elasticNet).mean())

计算预测值

y_train_ela = elasticNet.predict(X_train)
y_test_ela = elasticNet.predict(X_test)

5.5.3 画图检验 ElasticNetCV Model 结果

同 Ridge、Lasso Model，此处也分为三步

画 residuals（预测值减去真实值）：同时画 y_train_las, y_test_las.
直接画出预测值：同时画 y_train_las, y_test_las.
画重要的参数

总结：ElasticNetCV 选择的较好的 L1 ratio 为 1，即使用 Lasso regressor。事实上，该模型不需要任何的 L2 regularization 来克服 L1 的缺点。

结论

Linear Regression 在认真整理数据集并且优化 regularization 的情况下会得到不错的预测，这比使用之前 kaggle 比赛中表现不错的算法要好得多。

附录

原始学习数据说明：

File descriptions

train.csv - the training set
test.csv - the test set
data_description.txt - full description of each column, originally prepared by Dean De Cock but lightly edited to match the column names used here
sample_submission.csv - a benchmark submission from a linear regression on year and month of sale, lot square footage, and number of bedrooms
Data fields

Here's a brief version of what you'll find in the data description file.

SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
MSSubClass: The building class 等级，有顺序意义。
MSZoning: The general zoning classification 分类，无顺序意义。
LotFrontage: Linear feet of street connected to property 数字型。
LotArea: Lot size in square feet
Street: Type of road access 分类。
Alley: Type of alley access 分类。
LotShape: General shape of property 分类。
LandContour 等高线: Flatness of the property 分类。
Utilities: Type of utilities available 分类
LotConfig: Lot configuration 分类
LandSlope: Slope of property 分类，顺序有意义
Neighborhood: Physical locations within Ames city limits
Condition1: Proximity to main road or railroad
Condition2: Proximity to main road or railroad (if a second is present)
BldgType: Type of dwelling
HouseStyle: Style of dwelling
OverallQual: Overall material and finish quality
OverallCond: Overall condition rating
YearBuilt: Original construction date
YearRemodAdd: Remodel date
RoofStyle: Type of roof
RoofMatl: Roof material
Exterior1st: Exterior covering on house
Exterior2nd: Exterior covering on house (if more than one material)
MasVnrType: Masonry veneer type
MasVnrArea: Masonry veneer area in square feet
ExterQual: Exterior material quality
ExterCond: Present condition of the material on the exterior
Foundation: Type of foundation
BsmtQual: Height of the basement
BsmtCond: General condition of the basement
BsmtExposure: Walkout or garden level basement walls
BsmtFinType1: Quality of basement finished area
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Quality of second finished area (if present)
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
HeatingQC: Heating quality and condition
CentralAir: Central air conditioning
Electrical: Electrical system
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Number of bedrooms above basement level
Kitchen: Number of kitchens
KitchenQual: Kitchen quality
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality rating
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
GarageType: Garage location
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
GarageCond: Garage condition
PavedDrive: Paved driveway
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Fence: Fence quality
MiscFeature: Miscellaneous feature not covered in other categories
MiscVal: $Value of miscellaneous feature
MoSold: Month Sold
YrSold: Year Sold
SaleType: Type of sale
SaleCondition: Condition of sale

机器学习 Linear Regression Tricks 实例详解-V1.2

具体算法流程

1. 导入数据

2. Pre-processing

2.1 去除异常值 Potential Pitfalls/Outliers

2.2 Take log 以消减误差

2.3 Handle missing values

2.4 特殊数据的处理

2.4.1 将 numerical features 转为 categories

2.4.2 将 category features 转为 ordered numbers

3. Create new features

3.1 简化 features 1* Simplifications of existing features

3.2 简化 features 2* Combinations of existing features

3.3 简化 features 3* Polynomials on the top 10 existing features

3.3.1 寻找重要 features

3.3.2 Features and Polynomial Regressions

4. Create Features 后的数据再处理

4.1 处理仍遗留的缺失数据

4.1.1 区分出 numerical 和 categorical features

4.1.2 缺失数据填充

4.2 Take Log

4.3 Create dummy features for categorical values

5. Modeling

5.1 数据整合与 validation

5.1.1 数据整合

5.1.2 分割数据集

5.1.3 standardize numerical features

5.1.4 Define error measure

5.2 *1. Linear Regression without regularization

5.3 *2. Linear Regression with Ridge regularization (L2 penalty)

5.3.1 寻找合适的 Ridge Regression Model

5.3.2 用合适的 Ridge Model 计算预测值

5.3.3 画图检验 Ridge Model 结果

5.4 * 3. Linear Regression with Lasso regularization (L1 penalty)

5.4.1 寻找合适的 Lasso Regression Model

5.4.2 用合适的 Lasso Model 计算预测值

5.4.3 画图检验 Lasso Model 结果

5.5 * 4. Linear Regression with ElasticNet regularization (L1 and L2 penalty)

5.5.1 寻找合适的 ElasticNet Model

5.5.2 用合适的 ElasticNetCV Model 计算预测值

5.5.3 画图检验 ElasticNetCV Model 结果

结论

附录

原始学习数据说明：

推荐阅读更多精彩内容