Author: Pan

Date: 2020/7/25

我们在(四)(五)再生核中，我们谈到了如何将低维的数据映射到高维来进行一个非线性变换使得数据可分，我们并不需要花费大量的运算去寻找这个映射本身，我们只需要关注高维内积即可。那么这个究竟有什么用呢，我们将在这里谈到它的应用之一——核主元分析(Kernel Principle Component Analysis)。

在谈及这个话题之前，我们需要先了解什么是主元分析(Principle Component Analysis;PCA)，以及它的优缺点，并从优化的角度对PCA进行进一步的解释。

1.主元分析的基本概念

1.首先谈一谈我们为什么需要PCA，这其实和我们的数据特征的相关性有关，为什么我们不希望有冗余的特征呢，其实在(一)机器学习的基本概要中我们已经讨论过这个问题，这个原因是模型的复杂度，如果模型的复杂度太高，我们的泛化误差上界就会很大，在(一)中我们知道泛化误差上界是模型复杂度与样本数量的函数，即： $R_{exp}(f)\geq R_{emp}(f)+\sqrt{\frac{1}{2N}\cdot ln(\frac{d}{\alpha })}$ ;其中d是假设空间的大小，N是样本数量。当我们的数据有冗余特征，模型假设空间内的参数向量个数与维数都会显著增加，即假设空间d增大，很有可能导致模型过拟合，模型的过拟合讲导模型过分追求训练数据上的精确度，对现实数据的泛化能力变差，这是我们不愿意看到的。于是降维就是我们必须的一个手段。而PCA就是降维的其中一种方式。

那么什么是PCA呢。

首先我们对Population PCA进行定义，这个定义并非基于样本，而是基于总体。

我们给出如下定义：

定义：如果 $\vec{X}\in R^{p}$ 是一个均值为 $\vec{\mu}$ ,方差矩阵为 $\Sigma$ 的随机向量，那么从 $X$ 到 $Y$ 的一个映射可以写作： $Y=(X-\vec{\mu})\cdot U$ ；其中， $U$ 是 $p\times k$ 的正交矩阵，且 $U^{T}\cdot \Sigma \cdot U = \Lambda ;\Lambda=diag(\lambda_{1},\lambda_{2}...\lambda_{k});\lambda_{1}\geq\lambda_{2}\geq...\geq\lambda_{k}; 0\leq k \leq p;$

这里的一些性质不禁让我们联想到(二)多元高斯分布与概率图条件独立性假设中的定理1。

因为其实PCA的隐含假设是数据要服从高斯分布，当数据服从于高斯分布时，根据(二)中的定理1， $\vec{X}\tilde{}N(\vec{\mu},\Sigma)$ ;那么 $\vec{Y}\tilde{}N(0,U^{T}\cdot \Sigma \cdot U = \Lambda)$ 我们知道U其实是方差矩阵的正交的特征向量构成的矩阵，那么实际上对于输出的降维向量Y来说，它的方差其实就是数据方差矩阵的特征值，且Y的方差矩阵是对角矩阵，非对角线的元素都等于0，即新产生的Y的特征与特征之间的维度是不相关的。其实，从过程上看，PCA主要包含两个主要组成部分，1.降维。2.使得特征之间的相关性尽可能小。虽然它有这样的优势，但是这个高斯分布的隐含假设其实成为PCA的一个弱点，因为这要求数据样本量足够大，且这样一种线性降维方式会使得降维之后不同类别的数据点之间杂糅，难以区分。

前面我们谈到的定义是针对总体的Population PCA，现在我们来谈谈具体的样本PCA

定义：令 $X=[\vec{x_{1}},\vec{x_{2}},...,\vec{x_{n}}]^{T}$ 为 $n \times p$ 的样本数据矩阵，其中每个特征上的矩阵构成一个向量 $\bar x=\frac{1}{n}\cdot \sum_{i=1}^{n}\vec{x_{i}}$ ; $S=\frac{1}{n}X^{T}HX;H=I_{n}-\frac{1}{n}1_{n}1_{n}^{T}$ ;将S进行特征值分解 $S=GLG^{T}$ ;令 $G_{[p\times p]}\to G_{q[p\times q]}$ ,使得 $\vec{y}=G_{q}^{T}(\vec{x}-\bar{x})=[y_{1},y_{2},...,y_{q}]$ ;其中， $y_{i}$ 是数据矩阵的第i个主元。

事实上，在这个定义中，S这个矩阵我们在(二)多元高斯分布与概率图条件独立性假设中有提到过，H是一个中心矩阵且是个等幂矩阵，我们知道S是个对称矩阵，对称矩阵一定可以进行特征值分解，我们的目的是找到我们的投影矩阵G，即方差矩阵正交的特征向量G，将数据从p维映射到q维。令人好奇的是，数据从高维到低维到底扔掉了哪些特征呢，其实将方差矩阵特征分解后进行对应，我们可以得出 $\sigma_{i,j}=\sum_{k=1}^{p}\lambda_{k}\vec{t_{i}}_{[k]}\vec{t_{j}}_{[k]};\vec{t_{i}}[k]代表第i个特征向量里的第k个元素$ ;也就是是说X上的每一个(协)方差，是特征值在其对应的特征向量上每个维度元素的线性组合。PCA做的其实就是将对(协)方差贡献度小的特征值给丢掉，这个过程使得降维后的数据尽可能保证其信息的完整性。

PCA得到对方差贡献最大的几个主元作为Y的方差，并保留该主元对应的特征向量作为从高维到低维的映射基，将中心化后的数据X投影到每个主元上，得到一组相互之间尽可能线性无关的特征。这表明PCA不是一个简单的从原特征空间中对特征的筛选丢弃过程，而是一个将所有特征线性重构的一个过程。

说到这，其实还是有个地方不确定，那就是我得扔掉多少特征比较合适，其实这个问题上模型留了一个接口，即将这个问题交给应用者进行判断，扔掉多少或者保留多少是个参数。即降维后L矩阵的迹(trace)与降维之后L矩阵的迹(trace)之比.例如，我们想保留80%的特征信息，即上述比率要达到80%时，降维后L的个数即为保留特征的个数。而保留多少将取决于具体问题。

这是基本的PCA，但我们不得不说它的另一个问题。假设一下我们有一组数据，每一行是一张人脸，是一个1000*1000为的像素图片，展开后有1000000维的特征，而数据个数也就1000个，如果我们要对这个数据集做PCA，我们就不得不算一个1000000*1000000的协方差矩阵，再对一个这么大的协方差矩阵进行特征分解，这个计算上是特别划不来的，那么我们可以怎么做呢。这就需要另一种处理方式--主坐标分析(Principle Coordinate Analysis；PCO)

2.主坐标分析PCO

我们首先来看样本协方差矩阵 $S=\frac{1}{n}X^{T}HX;H=I_{n}-\frac{1}{n}1_{n}1_{n}^{T}$ ，这个矩阵的好处在于H是个等幂矩阵，即 $S=\frac{1}{n}X^{T}HX=\frac{1}{n}X^{T}HHX$ ；那么我们可以构造一个新的 $T=\frac{1}{n}HXX^{T}H$ ;因为我们要求的是协方差矩阵的特征向量，我们可以发现S与T之间有个很好的对应。令 $A=X^{T}H;B=HX;S=\frac{1}{n}AB;T=\frac{1}{n}BA$ ;

那么 $S\cdot \vec{v}=\lambda\cdot \vec{v} \to \frac{1}{n}AB\cdot \vec{v}=\lambda\cdot \vec{v}\\ \to \frac{1}{n}BAB\cdot \vec{v}=\lambda\cdot B\vec{v}\\ \to \frac{1}{n}BA(B\cdot \vec{v})=\lambda\cdot (B\cdot\vec{v})\\ \to T(B\cdot \vec{v})=\lambda\cdot (B\cdot \vec{v})$

T与S具有相同的非零特征值，而且很重要的一点是，当我们对S的那些特征值较大且与之对应的特征向量使用上述关系时，有

$HX\cdot V=(I_{n}-\frac{1}{n}\vec{1}_{n}^{T}\vec{1}_{n})X\cdot V=(X-\mu)V$ ;而这个恰好是我们需要求得的PCA定义里的公式，神奇的是我们不需要通过S，直接通过T就可以直接导出我们要的公式，即T的特征分解所得到的特征向量矩阵就是我们要的结果。

这样有什么好处呢，其实可以看出来假如特征很多，我们避免了算S，对于上面那个人脸的问题，我们只需要存一个1000*1000的矩阵T而不是1000000*1000000的矩阵S。这样就省下来许多麻烦。

这就是PCO的精髓，实际上我觉得PCO是PCA的一个补充，具体用哪个看数据量和特征数量，当数据数量很大，但是特征数量很小时，采用PCA;当特征数量很大，但是数据数量并没有那么大时，我们采用PCO会比较方便。

现给出PCA的Python实现(忘记今天有代码要用Markdown了...凑合着看吧，实在不行复制到pycharm或者其他的文本编辑器里也行)：

# ** Statistical Learning **

# File Name: Dimensionality Reduction

# Author: Pan

# Description: An Algorithm to make a dimensionality reduction for our data

# Date: 2020/7/24

# ** --Start-- **

import numpy as np

class Dimensionality_Reduction:

# This class is defined for all dimensionality reduction algorithms,such as PCA,PCO...

# we don't use the numpy function like np.cov() cause we want to show the details of inner mechanics.

def __init__(self,X,retained_ratio):

# This function defines our input data,centric matrix,covariance matrix of data

self.X=X

# X is our input data

self.retained_ratio = retained_ratio

# It is a ratio determine how many features should be dropped and retained.

self.n=np.shape(self.X)[0]

# n is a variable represent the number of our data's row

self.p=np.shape(self.X)[1]

# p is a variable represent the number of our data's column.

def PCA(self):

self.H=np.identity(self.n)-np.dot(np.array([np.ones(self.n)]).T,np.array([np.ones(self.n)]))

# H is a centric matrix with shape nxn; H=[I]-[1]^T(dot)[1]

self.S=self.X.T.dot(self.H).dot(self.X)

# S is the covariance matrix of X to tell us the differentiation between every feature pairs of X.

self.EigenValues,self.EigenVector=np.linalg.eig(self.S)

# By doing this,we get eigenvalues with shape 1xp and the company eigenvectors with shape pxp

self.retained_num=np.argwhere(np.cumsum(sorted(self.EigenValues,reverse=True))/np.trace(self.S)<=self.retained_ratio)[-1]

# it is a formula to get the splitting order betweeen the dropped and the retained.

# get t by calculate ratio(t)=\sigma_{i=1}^{t}{\lambda_{i}} / \sigma_{i=1}^{n}{\lambda_{i}} if ratio==retained_ratio;

self.EigenRanking=self.p-1-np.argsort(self.EigenValues)

# np.argsort can mark the ranking order in every element of the original unsorted vector.

# Since the return value of np.argsort is a vector with accending order,we use p-1 dimension to turn the accending process to the decending.

self.index_selection=np.where(self.EigenRanking<=self.retained_num)

# index_selection is a vector contained true or false of our selected eigenvectors.

# e.g. self.index_selection=[True,False,True,True,False]

self.U=self.EigenVector[self.index_selection]

# This is our important orthogonal matrix with shape (self.retained_num)x(self.p) to accomplish our dimensionality reduction task.

self.Y=(self.X-np.array([np.mean(self.X,axis=0)]*self.n)).dot(self.U.T)

# calculate the dimensionality reduction result by Y=((X-\mu)U.T)

return self.Y

def PCO(self):

self.H=np.identity(self.n)-np.dot(np.array([np.ones(self.n)]).T,np.array([np.ones(self.n)]))

# H is a centric matrix with shape nxn; H=[I]-[1]^T(dot)[1]

self.T=self.H.dot(self.X).dot(self.X.T).dot(self.H)

# S can be replaced by T; by doing this,we can calculate the reduction output directly.

self.EigenValues, self.EigenVector = np.linalg.eig(self.T)

# By doing this,we get eigenvalues with shape 1xp and the company eigenvectors with shape pxp

print(self.EigenValues)

self.retained_num =np.argwhere(np.cumsum(sorted(self.EigenValues, reverse=True)) / np.trace(self.T) <= self.retained_ratio)[-1]

# it is a formula to get the splitting order betweeen the dropped and the retained.

# get t by calculate ratio(t)=\sigma_{i=1}^{t}{\lambda_{i}} / \sigma_{i=1}^{n}{\lambda_{i}} if ratio==retained_ratio;

self.EigenRanking = self.p - 1 - np.argsort(self.EigenValues)

# np.argsort can mark the ranking order in every element of the original unsorted vector.

# Since the return value of np.argsort is a vector with accending order,we use p-1 dimension to turn the accending process to the decending.

self.index_selection = np.where(self.EigenRanking <= self.retained_num)

# index_selection is a vector contained true or false of our selected eigenvectors.

# e.g. self.index_selection=[True,False,True,True,False]

self.U = self.EigenVector[self.index_selection]

# the eigenvector matrix is the result we want.

return self.U.T

# ** --END-- **

if __name__ == '__main__':

import seaborn as sns

import matplotlib.pyplot as plt

from sklearn.datasets import load_diabetes

def plot(dfData):

plt.subplots(figsize=(9, 9))

sns.heatmap(dfData, annot=True, vmax=1, square=True, cmap="Blues")

plt.show()

data=load_diabetes()

X=data["data"]

Algorithm_Core=Dimensionality_Reduction(X,0.8)

Y=Algorithm_Core.PCA()

plot(np.corrcoef(X.T))

plot(np.corrcoef(Y.T))

以数据集diabetes为例

降维前的相关系数图：

Fig.1 降维前各维度之间的相关系数热力图

降维后:

Fig.2 降维后各维度之间的相关系数热力图

（六）核主元分析 PART1

（六）核主元分析 PART1

1.主元分析的基本概念

2.主坐标分析PCO