Now we have entered in the era of "big data". We have accumulated so many data that we can't all information from them. On the other hand, so many data may form noise to separate you from truth. So to learn potential pattern from given data, we need to pre-train and filter data.
Now given n samples, x_1,...x_n, in d-dim. If d is very large which means x has many features, then we may do some feature selection before we start to learn. One way is to do principal component analysis for these samples. For example, if all sample points on plane almost lie on one straight line, then that straight line can be seen as 1-dim principal component of data.
Zero-dimension representation by PCA
If we use only one vector to represent all sample points, then the vector must be the average of all sample points.
one-dim representation by PCA
If we want to find one line close to all sample points and use projections to approximate sample points, then the line must go through sample average point.
To find a d'-dim PC of sample points, it is equivalent to solve
The vectors e_i all have length 1. So use Lagrange optimization to solve. Then we can get all e_i are eigenvectors of scatter matrix S, that is
S is n*n matrix which is real and symmetric. Then its eigenvectors are orthogonal and its eigenvalues are nonnegative. The eigenvalues corresponding e_1,...,e_d' are the first d' maximal eigenvalues of S. And the squared error above has an explicit expression by information of S, that is sum of eigenvalues except first d' maximal eigenvalues. And since eigenvectors are orthogonal, they can be used to represent d'-dim subspace center at sample average point.