GWAS中使用均值、BLUP值还是BLUE值,邓飞老师有两篇帖子进行了介绍,链接如下,根据我的数据,我选择了BLUE值。在这里提醒大家,不要一上来就忽略所有warning,很容易找不到问题。
# 一些教程中,会在开头忽略所有的警告,即以下的code,建议不要这么做
options(lmerControl=list(check.nobs.vs.rankZ = "warning",
check.nobs.vs.nlev = "warning",
check.nobs.vs.nRE = "warning",
check.nlev.gtreq.5 = "warning",
check.nlev.gtr.1 = "warning"))
GWAS和GS使用BLUE值还是BLUP值作为表型值?_邓飞----育种数据分析之放飞自我-CSDN博客
科学网-混合线性模型中BLUE值 VS BLUP值-邓飞的博文 (sciencenet.cn)
不想看原理的话,就是下面这句总结:
“整体而言,BLUP值会向均值收缩(shrinkage),虽然结果是最佳预测,但是校正值的方差变小,当你做GWAS时,不容易找到显著性位点,增加了噪音(noise)。而且在GWAS中,品种是作为随机因子,如果你使用BLUP值,相当于进行了两次收缩(shrinkage)。因此, 比较好的方式是,在one-stage中,将地点,年份,区组作为随机因子,将品种作为固定因子,计算BLUE值”。
1.数据导入
数据中包含188个品种,11个性状,两年,每年分别三个区组,每个区组内5个单株重复的数据。
> setwd("D:/GWAS_phe")
> raw_data <- read.table("phe_raw.txt", header = T, check.names = F, sep = "\t")
> dat=raw_data
> library('magrittr')
> head(dat)
Cul Rep Year TL PA SA AD V NT UFW NR AL UDW MTL
1 1 1 2017 40.37 2.81 8.82 0.695 0.150 94 0.26 58 0.70 0.05 90.68
2 1 1 2017 62.99 4.33 13.62 0.688 0.230 139 0.54 42 1.50 0.04 90.68
3 1 1 2017 90.68 7.30 22.93 0.805 0.460 115 0.36 38 2.39 0.03 90.68
4 1 1 2017 42.09 4.36 13.71 0.837 0.360 89 0.38 22 1.91 0.03 90.68
5 1 1 2017 57.25 5.97 18.75 0.943 0.490 77 0.64 36 1.59 0.03 90.68
6 1 2 2017 25.30 1.76 5.54 0.697 0.097 30 0.23 24 1.05 0.02 35.30
> for(i in 1:3) dat[,i] = as.factor(dat[,i]) #前三列设置为factor
> str(dat)
'data.frame': 5640 obs. of 14 variables:
$ Cul : Factor w/ 188 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Rep : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 2 2 2 2 2 ...
$ Year: Factor w/ 2 levels "2017","2020": 1 1 1 1 1 1 1 1 1 1 ...
$ TL : num 40.4 63 90.7 42.1 57.2 ...
$ PA : num 2.81 4.33 7.3 4.36 5.97 1.76 2.53 1.77 2.72 2.47 ...
$ SA : num 8.82 13.62 22.93 13.71 18.75 ...
$ AD : num 0.695 0.688 0.805 0.837 0.943 ...
$ V : num 0.15 0.23 0.46 0.36 0.49 0.097 0.168 0.141 0.191 0.136 ...
$ NT : int 94 139 115 89 77 30 37 21 34 29 ...
$ UFW : num 0.26 0.54 0.36 0.38 0.64 0.23 0.22 0.3 0.26 0.24 ...
$ NR : int 58 42 38 22 36 24 28 33 23 18 ...
$ AL : num 0.7 1.5 2.39 1.91 1.59 1.05 1.06 0.53 1.33 1.96 ...
$ UDW : num 0.05 0.04 0.03 0.03 0.03 0.02 0.02 0.03 0.02 0.02 ...
$ MTL : num 90.7 90.7 90.7 90.7 90.7 ...
2. R包lme4实现blue值计算
2.1 单性状blue值计算 (以TL为例)
> library('lme4')
> m = lmer(TL ~ Cul + (1|Rep)+(1|Year:Rep) + (1|Year), data = dat)
#固定因子:Cul
#随机因子:Year + Rep + Year:Rep
boundary (singular) fit: see ?isSingular
此时出现了报错boundary (singular) fit: see ?isSingular,边界畸形拟合。
在这个地方我卡了一周,完全不知道是哪里出了问题,后来发现,是Rep!Rep不是单独存在的因子,所以不能出现单独的Rep。
之后参照教程,模型更改为以下写法:
> m = lmer(TL ~ Cul + (1|Year) + (1|Rep%in%Year) + (1|Cul:Year), data = dat)
错误: grouping factors must have > 1 sampled level
再一次报错grouping factors must have > 1 sampled level,只适用于因子大于一组的模型。继续查资料,Rep%in%Year,这个写法是嵌套模型,而Rep表面来看是重复,但是实际是区组效应,很大一部分是由于地块间/地块内的微环境造成的,和年份应该是交叉或互作效应,并不存在嵌套。
再次修改模型,如下:
> m1 = lmer(TL ~ Cul +(1|Year:Rep) + (1|Year), data = dat)
没有再出现报错了!!激动!!!
2.2 模型检验
解决了Rep的问题后,又想到了一个新的问题,各个因子之间的交互关系考不考虑?如果考虑的话,考虑多少?所以,引入了模型检验,根据数据,我列出了以下几种模型:
> m1 = lmer(TL ~ Cul +(1|Year:Rep) + (1|Year), data = dat)
> m2 = lmer(TL ~ Cul +(1|Year:Rep) + (1|Year) +(1|Year:Cul), data = dat)
> m3 = lmer(TL ~ Cul +(1|Year:Rep) + (1|Year) +(1|Rep:Cul), data = dat)
> m4 = lmer(TL ~ Cul +(1|Year:Rep) + (1|Year) +(1|Year:Cul) +(1|Rep:Cul), data = dat)
> m5 = lmer(TL ~ Cul +(1|Year:Rep) + (1|Year) +(1|Year:Cul) +(1|Rep:Cul) +(1|Rep:Cul:Year), data = dat)
五种模型都没有报错,那到底哪一个模型更好呢?可以用anova方法进行模型间的比较:
> anova(m1,m2,m3,m4,m5)
refitting model(s) with ML (instead of REML)
Data: dat
Models:
m1: TL ~ Cul + (1 | Year:Rep) + (1 | Year)
m2: TL ~ Cul + (1 | Year:Rep) + (1 | Year) + (1 | Year:Cul)
m3: TL ~ Cul + (1 | Year:Rep) + (1 | Year) + (1 | Rep:Cul)
m4: TL ~ Cul + (1 | Year:Rep) + (1 | Year) + (1 | Year:Cul) + (1 | Rep:Cul)
m5: TL ~ Cul + (1 | Year:Rep) + (1 | Year) + (1 | Year:Cul) + (1 | Rep:Cul) + (1 | Rep:Cul:Year)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
m1 191 38747 39981 -19182 38365
m2 192 37937 39177 -18776 37553 811.918 1 < 2.2e-16 ***
m3 192 38749 39989 -19182 38365 0.000 0
m4 193 37866 39113 -18740 37480 885.047 1 < 2.2e-16 ***
m5 194 37768 39022 -18690 37380 99.287 1 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
上述结果看出,m2和m4、m5的效应达到极显著水平,m5的AIC值和BIC值也最低,因此,最优模型为m5。
2.3 计算
> as.data.frame(fixef(m5))
fixef(m5)
(Intercept) 46.85666666
Cul2 -13.87600506
Cul3 -9.26492182
Cul4 -22.73966667
Cul5 -14.38666667
Cul6 -19.41466667
Cul7 4.93466667
Cul8 -13.18400000
Cul9 -9.34900000
Cul10 -3.54966667
植物中,需要使用lsmeans添加截距
> library('lsmeans')
> re=lsmeans(m5,"Cul")
Note: D.f. calculations have been disabled because the number of observations exceeds 3000.
To enable adjustments, add the argument 'pbkrtest.limit = 4733' (or larger)
[or, globally, 'set emm_options(pbkrtest.limit = 4733)' or larger];
but be warned that this may result in large computation time and memory use.
Note: D.f. calculations have been disabled because the number of observations exceeds 3000.
To enable adjustments, add the argument 'lmerTest.limit = 4733' (or larger)
[or, globally, 'set emm_options(lmerTest.limit = 4733)' or larger];
but be warned that this may result in large computation time and memory use.
> re
Cul lsmean SE df asymp.LCL asymp.UCL
1 46.9 11.7 Inf 23.857 69.9
2 33.0 11.7 Inf 9.967 56.0
3 37.6 11.7 Inf 14.579 60.6
4 24.1 11.7 Inf 1.117 47.1
5 32.5 11.7 Inf 9.470 55.5
6 27.4 11.7 Inf 4.442 50.4
7 51.8 11.7 Inf 28.791 74.8
8 33.7 11.7 Inf 10.673 56.7
9 37.5 11.7 Inf 14.508 60.5
10 43.3 11.7 Inf 20.307 66.3
Degrees-of-freedom method: asymptotic
Confidence level used: 0.95
结果中的lsmeans即为各品种该性状的BLUE值。
2.4 结果输出
> blue.df<-data.frame(re)[,1:2]
> colnames(blue.df)<-c('Cul','TL')
> head(blue.df)
Cul TL
1 1 46.85667
2 2 32.98066
3 3 37.59174
4 4 24.11700
5 5 32.47000
6 6 27.44200
> write.table(blue.df, file = "TL_blue.txt", row.names = F, col.names = T, quote = F, sep = "\t")