主成分分析 (PCA, principal component analysis)是一种数学降维方法。
PCA降维过程;
1)数据标准化
2)求协方差矩阵
3)特征向量排序
4)投影矩阵
5)数据转换
将样本数据求一个维度的协方差矩阵,然后求解这个协方差矩阵的特征值和对应的特征向量,将这些特征向量按照对应的特征值从大到小排列,组成新的矩阵,被称为特征向量矩阵,也可以称为投影矩阵,然后用改投影矩阵将样本数据转换。取前K维数据即可,实现对数据的降维。
案例1
创建数据集
- 用R模拟芯片数据矩阵,矩阵为10000行(10000个基因),100列(100个样本),生成均值为0的正态分布的随机数据。
chip.data<-matrix(rnorm(10000*100,mean=0),nrow=10000,ncol=100)
显示结果:
2,在10000个基因中,假定有100个基因在两组间存在差异,前50个上调,另50个下调;
1)创建1000个1~1000的随机数,作为索引
2)创建50*10的正态分布矩阵,均值为2,通过sha上一步的随机数读取1:50的数字作为行号,前10列,赋值给chip.data,作为上调数据集。
3)相同方法得到50个下调的数据集
diff.index<-sample(1:1000,1000)
chip.data[diff.index[1:50],1:10]<-rnorm(50*10,mean=2)
chip.data[diff.index[1:50],1:10]<-rnorm(50*10,mean=-2)
- PCA作图
princomp函数使用方法
Description
princomp performs a principal components analysis on the given numeric data matrix and returns the results as an object of class princomp.
## Default S3 method:
princomp(x, cor = FALSE, scores = TRUE, covmat = NULL,
subset = rep_len(TRUE, nrow(as.matrix(x))), ...)
PCA统计
chip.data<-princomp(chip.data)
显示chip.data的数据
> chip.data
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -8.764830e-01 -2.585436e+00 1.7486665932 0.6825088090 0.8905718598 2.2543743674
[2,] 2.756559e+00 9.191507e-01 1.7224333465 2.5164729313 0.3655551313 0.3940460436
[3,] 9.754316e-01 -9.121371e-01 -0.0534088859 0.4711108467 -0.6567994543 -0.9404594391
[4,] -1.443449e+00 6.328793e-01 0.7067575122 -2.0083705142 -0.0641474431 0.5404051953
[5,] -1.678596e+00 -4.086325e-01 -0.6946972480 0.9941794052 1.9677986393 0.4281278343
[6,] 2.318705e+00 2.574536e+00 2.4483722951 3.7352614791 0.6849518201 2.5269332706
[7,] 1.368299e+00 -6.396757e-01 -0.3016863422 -0.9881343210 0.7250075490 -1.1474935276
[8,] 4.547110e-01 -1.388434e+00 0.5724884590 1.3446862438 0.2708813623 0.0768302649
[9,] -3.320154e-01 1.015236e+00 0.0524039788 0.8327729956 1.5803932962 -1.1469311968
[10,] 1.442150e+00 -1.005228e+00 0.9377764607 1.5061633084 -0.7742683227 -1.9687078752
显示统计结果
> summary(chip.data)
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
Standard deviation 3.240085 3.2099856 3.1956557 3.1691590 3.1505363 3.13960683 3.11757677 3.10222437 3.07273039 3.05572866
Proportion of Variance 0.105799 0.1038424 0.1029174 0.1012178 0.1000317 0.09933886 0.09794967 0.09698734 0.09515192 0.09410186
Cumulative Proportion 0.105799 0.2096414 0.3125588 0.4137765 0.5138082 0.61314710 0.71109677 0.80808411 0.90323603 0.99733790
Standard deviation # 标准方差
Proportion of Variance # 贡献度
Cumulative Proportion # 累计贡献度
前10个主成分已可以dad达到解析0.99733790的数据
- 画图
1)设置两组100个差异基因的颜色。可以通过更改,“2”“7”的1:10范围的数字,更改两组的颜色
2)plot3d(xlab,ylab,zlab三维数据集,分组颜色,图形类型,半径)
以下为type:s,代表图形为球星
colour<-c(rep(2,50),rep(7,50))
library(rgl)
plot3d(chip.data.pca$loadings[,1:3],col=colour,type="s",radius = 0.025)
显示结果3D图,可以使用鼠标进行旋转和方法缩小,直到最清晰角度为止。
plot3d(chip.data.pca$loadings[,1:3],col=colour,type="l",radius = 0.025)
显示线性结果:
案例2
加载包和数据集
rm(list=ls())
library(pca3d)
library(rgl)
data(metabo)
head(metabo)
数据集介绍
Metabolic profiles in tuberculosis. # 肺结核代谢数据集
Description
Relative abundances of metabolites from serum samples of three groups of individuals
# 三组血清样本的相对丰度
Details
A data frame with 136 observations on 425 metabolic variables.
136个观测值,425ge个daixie个代谢变量
Serum samples from three groups of individuals were compared: tuberculin skin test negative (NEG), positive (POS) and clinical tuberculosis (TB).
#比较三组患者的血清样本:结核菌素皮肤试验阴性(NEG)、阳性(POS)和临床结核(TB)。
PCA计算
prcomp函数使用方法
Principal Components Analysis
Description
Performs a principal components analysis on the given data matrix and returns the results as an object of class prcomp.
## Default S3 method:
prcomp(x, retx = TRUE, center = TRUE, scale. = FALSE,
tol = NULL, rank. = NULL, ...)
1)去除数据集的第一列行名作为数据集,标准化数据
2)以数据集的第一列行名作为分组因子
metabo.pca <- prcomp(metabo[,-1], scale.=TRUE)
groups <- factor(metabo[,1])
统计计算结果
> summary(metabo.pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14
Standard deviation 5.86992 5.38923 4.74978 4.11434 3.88969 3.81589 3.30208 3.09675 2.9872 2.9157 2.80259 2.71364 2.60341 2.56392
Proportion of Variance 0.08146 0.06866 0.05333 0.04002 0.03577 0.03442 0.02578 0.02267 0.0211 0.0201 0.01857 0.01741 0.01602 0.01554
Cumulative Proportion 0.08146 0.15012 0.20345 0.24347 0.27924 0.31366 0.33944 0.36211 0.3832 0.4033 0.42187 0.43928 0.45530 0.47084
作图
pca3d使用方法
pca2d {pca3d} R Documentation
Show a three- or two-dimensional plot of a prcomp object
Description
Show a three- two-dimensional plot of a prcomp object or a matrix, using different symbols and colors for groups of data
Usage
pca3d(pca, components = 1:3, col = NULL, title = NULL, new = FALSE,
axes.color = "grey", bg = "white", radius = 1, group = NULL,
shape = NULL, palette = NULL, fancy = FALSE, biplot = FALSE,
biplot.vars = 5, legend = NULL, show.scale = FALSE,
show.labels = FALSE, labels.col = "black", show.axes = TRUE,
show.axe.titles = TRUE, axe.titles = NULL, show.plane = TRUE,
show.shadows = FALSE, show.centroids = FALSE, show.group.labels = FALSE,
show.shapes = TRUE, show.ellipses = FALSE, ellipse.ci = 0.95)
pca3d(数据集,分组,是否显示置信区间,显示默认值是0.95,而椭圆的大小为95。是否实现分隔平面)
pca3d(metabo.pca, group=groups, show.ellipses=TRUE, elle.ci=0.75, show.plane=FALSE)
显示结果3D图,可以使用鼠标进行旋转和方法缩小,直到最清晰角度为止。
取消外包围分隔平面
pca3d(metabo.pca, group=groups, show.ellipses=TRUE, ellipse.ci=0.75, show.plane=FALSE)
显示结果: