However, the linear model has distinct advantages in terms of inference and, on real-world problems, is often surprisingly competitive in relation to non-linear methods.
Before moving to the non-linear world, we discuss in this chapter some ways in which the simple linear model can be improved, by replacing plain least squares fitting with some alternative fitting procedures.
Why might we want to use another fitting procedure instead of least squares? As we will see, alternative fitting procedures can yield better prediction accuracy and model interpretability .
• Prediction Accuracy : If n>> p —that is, if n , the number of observations, is much larger than p , the number of variables—then the least squares estimates tend to also have low variance.and hence will perform well on test observations. However, if n is not much larger than p, then there can be a lot of variability in the least squares fit, resulting in overfitting and consequently poor predictions on future observations not used in model training.
• Model Interpretability : Now least squares is extremely unlikely to yield any coefficient estimates that are exactly zero.
In this chapter, we discuss three important classes of methods.
• Subset Selection. This approach involves identifying a subset of the p predictors that we believe to be related to the response. We then fit a model using least squares on the reduced set of variables.
• Shrinkage. This approach involves fitting a model involving all p predictors. However, the estimated coefficients are shrunken towards zero relative to the least squares estimates. This shrinkage (also known as regularization) has the effect of reducing variance. Depending on what type of shrinkage is performed, some of the coefficients may be estimated to be exactly zero. Hence, shrinkage methods can also perform variable selection.
• Dimension Reduction. This approach involves projecting the p predictors into a M-dimensional subspace, where M by computing M different linear combinations, or projections, of the variables. Then these M projections are used as predictors to fit a linear regression model by least squares.
Subset Selection
Best Subset Selection
Stepwise Selection
For computational reasons, best subset selection cannot be applied with very large p.Best subset selection may also suffer from statistical problems when p is large. The larger the search space, the higher the chance of finding models that look good on the training data, even though they might not have any predictive power on future data. Thus an enormous search space can lead to overfitting and high variance of the coefficient estimates.
Shrinkage Methods
The subset selection methods described in Section 6.1 involve using least squares to fit a linear model that contains a subset of the predictors. As an alternative, we can fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero. It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance. The two best-known techniques for shrinking the regression coefficients towards zero are ridge regression and the lasso.
Ridge Regression
Recall from Chapter 3 that the least squares fitting procedure estimates β0, β1, . . . , βp using the values that minimize:
Ridge regression is very similar to least squares, except that the coefficients are estimated by minimizing a slightly different quantity. In particular, the ridge regression coefficient estimates ˆβR are the values that minimize:
where λ ≥0 is a tuning parameter ,to be determined separately.
As with least squares, ridge regression seeks coefficient estimates that fit the data well, by making the RSS small. However, the second term, λjβ2j, called a shrinkage penalty,is small whenβ1, . . . , βp are close to zero.and so it has the effect of shrinking the estimates of βj towards zero.The tuning parameter λserves to control the relative impact of these two terms on the regression coefficient estimates.When λ= 0, the penalty term has no effect, and ridge regression will produce the least squares estimates.However, asλ→∞ , the impact of the shrinkage penalty grows, and the ridge regression coefficient estimates will approach zero.Unlike least squares, which generates only one set of coefficient estimates, ridge regression will produce a different set of coefficient estimates,βR λ, for each value of λ.
Selecting a good value for λ is critical.we defer this discussion to Section 6.2.3, where we use cross-validation.
Why Does Ridge Regression Improve Over Least Squares?
Ridge regression’s advantage over least squares is rooted in the bias-variance trade-off . As λ increases, the flexibility of the ridge regression fit decreases,leading to decreased variance but increased bias.
In particular, when the number of variables p is almost as large as the number of observations n , as in the example in Figure 6.5, the least squares estimates will be extremely variable.And if p > n , then the least squares estimates do not even have a unique solution, whereas ridge regression can still perform well by trading off a small increase in bias for a large decrease in variance. Hence, ridge regression works best in situations where the least squares estimates have high variance.(n=p或者n<p最小二乘法有很大方差,但是岭回归用一点bias的增加换回了方差的很大减小。)
The Lasso
Ridge regression does have one obvious disadvantage. Unlike best subset, forward stepwise, and backward stepwise selection, which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final model.The penalty will shrink all of the coefficients towards zero, but it will not set any of them exactly to zero (unless λ=∞). This may not be a problem for prediction accuracy, but it can create a challenge in model interpretation in settings in which the number of variables p is quite large.
The lasso is a relatively recent alternative to ridge regression that overcomes this disadvantage. The lasso coefficients The lasso coefficients:
In statistical parlance, the lasso uses L1 penalty instead of L2 penalty.
The Variable Selection Property of the Lasso
Selecting the Tuning Parameter
Dimension Reduction Methods
The methods that we have discussed so far in this chapter have controlled variance in two different ways, either by using a subset of the original variables, or by shrinking their coefficients toward zero. All of these methods are defined using the original predictors,X1,X2, . . . , Xp. We now explore a class of approaches that transform the predictors and then fit a least squares model using the transformed variables. We will refer to these techniques as dimension reduction methods.
The term dimension reduction comes from the fact that this approach reduces the problem of estimating the p+1 coefficientsβ0, β1, . . . , βp to the simpler problem of estimating theM+ 1 coefficientsθ0, θ1, . . . , θM, where M < p. In other words, the dimension of the problem has been reduced from p+ 1 toM+ 1.
Principal Components Regression
Principal components analysis (PCA) is a popular approach for deriving a low-dimensional set of features from a large set of variables. PCA is discussed in greater detail as a tool for unsupervised learning in Chapter 10.Here we describe its use as a dimension reduction technique for regression.
PCA is a technique for reducing the dimension of a nXp data matrix X.
There is also another interpretation for PCA: the first principal component vector defines the line that is as close as possible to the data.
The Principal Components Regression Approach
Partial Least Squares
Consequently, PCR suffers from a drawback: there is no guarantee that the directions that best explain the predictors will also be the best directions to use for predicting the response.
We now present partial least squares(PLS), a supervised alternative to PCR.
What Goes Wrong in High Dimensions?
In order to illustrate the need for extra care and specialized techniques for regression and classification when p > n , we begin by examining what can go wrong if we apply a statistical technique not intended for the high dimensional setting. For this purpose, we examine least squares regression. But the same concepts apply to logistic regression, linear discriminant analysis, and other classical statistical approaches.
When the number of features p is as large as, or larger than, the number of observations n , least squares as described in Chapter 3 cannot (or rather, should not) be performed.
The reason is simple: regardless of whether or not there truly is a relationship between the features and the response, least squares will yield a set of coefficient estimates that result in a perfect fit to the data, such that the residuals are zero.
When we perform the lasso, ridge regression, or other regression procedures in the high-dimensional setting, we must be quite cautious in the way that we report the results obtained.
we can never know exactly which variables (if any) truly are predictive of the outcome, and we can never identify the best coefficients for use in the regression. At most, we can hope to assign large regression coefficients to variables that are correlated with the variables that truly are predictive of the outcome.
It is also important to be particularly careful in reporting errors and measures of model fit in the high-dimensional setting.We have seen that when p > n , it is easy to obtain a useless model that has zero residuals. Therefore, one should never use sum of squared errors, p-values, R2 statistics, or other traditional measures of model fit on the training data as evidence of a good model fit in the high-dimensional setting.
It is important to instead report results on an independent test set, or cross-validation errors.An independent test set is a valid measure of model fit, but the MSE on the training set certainly is not.