玉米全基因关联分析 ( GWAS Analysis Pipeline )

全基因组关联分析是基于数量性状连锁不平衡理论，通过表型与基因型关联分析获得影响表型性状的显著关联标记。

基本流程

1. 软件

Tassel / Gaipt / FarmCPU / mrMLM / PLINK / Structure / SPAGeDi
以上软件都是在 GWAS 分析中可能用到的软件。

2. 数据

Genotype ( hmp || vcf format file ) + Penotype ( Traits )
关于表型的处理请见前一篇文章。

3. 分析流程

从原始数据的获得到 Call SNP 的流程请参考之前文章。
本文的分析从 hmp/vcf 格式的基因型文件开始。

3.1 基因型过滤

过滤标准如下：

1. Filter SNP Call Rate less than 90 % (max missing rate is 0.1)
2. Filter mult-allelic sites (include only bi-allelic sites)
3. Filter SNP heterozygosity rate more than 0.2 
4. Hardy-Weinberg Equilibrium using an exact test pvalue less than 0.01 
5. Minor Allele Frequency (MAF) <0.05
6. Filter Sample Call Rate less than 90 % (max sample call missing rate is 0.1)
7. Missing Rate < 0.05 (For Structure) **
8. Genetic Diversity (GD > 0.45, For Structure) **

具体操作流程如下：

## 1. 使用 Tassel 5 进行转换，可以用 GUI 操作，命令行操作如下：
# Step 1: sort hmp foramt file
/opt/bio/tassel-5-standalone/run_pipeline.pl -SortGenotypeFilePlugin -inputFile my_genotype.hmp.txt -outputFile my_genotype.sort.hmp.txt -fileType Hapmap

# Step 2 : convert hmp to vcf
/opt/bio/tassel-5-standalone/run_pipeline.pl -fork1 -h my_genotype.sort.hmp.txt -export -exportType VCF

## 切记，需要把基因型中 -- 转换为 NN
## 切记，需要把基因型中 -- 转换为 NN
## 切记，需要把基因型中 -- 转换为 NN

## 2. 使用 Plink 软件对进行基因型过滤
＃　Step3: filter genotype 
/opt/bio/plink-1.9/plink --vcf LiZL_Genotype_Sort.vcf --maf 0.05 --geno 0.05 --mind 0.7 --recode 'vcf' --biallelic-only strict --out LiZL_Genotype_Filter

## Step 3： Convert to normal format
/opt/bio/tassel-5-standalone/run_pipeline.pl  -fork1 -vcf ZXX_Genotype_Filter.vcf -export ZXX_Genotype_Filter.hmp.txt -exportType HapmapDiploid

3.2 群体结构

SNP 标记过滤完成后，使用程序在各条染色体上随机均匀提取一定数目的标记用来计算 Structure 群体结构，之所以不使用全部的标记是因为，全部标记间连锁作用明显并且计算量庞大。
一般使用 5000 个标记进行群体分层估算。同时群体所需要的标记更严格一些，过滤标准如下：

1. missing rate < 0.05,
2. Genetic diversity ( GD > 0.45 )

使用 plink 软件进行 Structure 数据格式转化：

## Step1 sort hapmap
/opt/bio/tassel-5-standalone/run_pipeline.pl -SortGenotypeFilePlugin -inputFile ForStructure.txt  -outputFile For_Q.sort.hmp.txt -fileType Hapmap

## Step2 Convert hapmap
/opt/bio/tassel-5-standalone/run_pipeline.pl -fork1 -h For_Q.sort.hmp.txt -export For_Q -exportType VCF

## Step3 Make for Q matrix
/opt/bio/plink-1.9/plink --vcf For_Q.vcf --recode structure --out Structure

数据准备好之后，使用经典的 Structure 软件进行群体 K 值估算。
Structure 软件运行参数设置（Command-line version）：
群体结构主要有两个参数文件， mainparams 和 extraparams 文件，具体设置如下：
每个参数均以 #define 开头：

KEY PARAMETERS FOR THE PROGRAM structure. YOU WILL NEED TO SET THESE IN ORDER TO RUN THE PROGRAM. VARIOUS OPTIONS CAN BE ADJUSTED IN THE FILE extraparams.

基本参数配置 ( mainparams ):

Basic Program Parameters

#define MAXPOPS    2      // (int) 预设的亚群数
#define BURNIN    10000   // (int) length of burnin period
#define NUMREPS   20000   // (int) number of MCMC reps after burnin

Input/Output files

#define INFILE   infile   // (str) 输入文件
#define OUTFILE  outfile  //(str) 输出的结果文件

Data file format

#define NUMINDS    100    // (int) 材料数目
#define NUMLOCI    100    // (int) 标记数目
#define PLOIDY       2    // (int) 几倍体
#define MISSING     -9    // (int) 基因型缺失值
#define ONEROWPERIND 0    // (B) 数据格式(单行/双行)

#define LABEL     1     // (B) 是否包含材料名称
#define POPDATA   1     // (B) Input file contains a population identifier
#define POPFLAG   0     // (B) Input file contains a flag which says whether to use popinfo when USEPOPINFO==1
#define LOCDATA   0     // (B) Input file contains a location identifier

#define PHENOTYPE 0     // (B) Input file contains phenotype information
#define EXTRACOLS 0     // (int) Number of additional columns of data before the genotype data start.

#define MARKERNAMES      1  // (B) 是否包含标记名称（首行）
#define RECESSIVEALLELES 0  // (B) data file contains dominant markers (eg AFLPs)
                            // and a row to indicate which alleles are recessive
#define MAPDISTANCES     1  // (B) data file contains row of map distances 
                            // between loci　（plink 软件出来格式有该行 ）

Advanced data file options

#define PHASED           0 // (B) Data are in correct phase (relevant for linkage model only)
#define PHASEINFO        0 // (B) the data for each individual contains a line indicating phase (linkage model)
#define MARKOVPHASE      0 // (B) the phase info follows a Markov model.
#define NOTAMBIGUOUS  -999 // (int) for use in some analyses of polyploid data

控制参数配置 ( extraparams )：

EXTRA PARAMS FOR THE PROGRAM structure. THESE PARAMETERS CONTROL HOW THE PROGRAM RUNS. ATTRIBUTES OF THE DATAFILE AS WELL AS K AND RUNLENGTH ARE SPECIFIED IN mainparams.


PROGRAM OPTIONS

#define NOADMIX     0 // (B) Use no admixture model (0=admixture model, 1=no-admix)
#define LINKAGE     0 // (B) Use the linkage model model 
#define USEPOPINFO  0 // (B) Use prior population information to pre-assign individuals to clusters
#define LOCPRIOR    0 //(B)  Use location information to improve weak data

#define FREQSCORR   1 // (B) allele frequencies are correlated among pops
#define ONEFST      0 // (B) assume same value of Fst for all subpopulations.

#define INFERALPHA  1 // (B) Infer ALPHA (the admixture parameter)
#define POPALPHAS   0 // (B) Individual alpha for each population
#define ALPHA     1.0 // (d) Dirichlet parameter for degree of admixture  (this is the initial value if INFERALPHA==1).

#define INFERLAMBDA 0 // (B) Infer LAMBDA (the allele frequencies parameter)
#define POPSPECIFICLAMBDA 0 //(B) infer a separate lambda for each pop (only if INFERLAMBDA=1).
#define LAMBDA    1.0 // (d) Dirichlet parameter for allele frequencies 

PRIORS

#define FPRIORMEAN 0.01 // (d) Prior mean and SD of Fst for pops.
#define FPRIORSD   0.05  // (d) The prior is a Gamma distribution with these parameters

#define UNIFPRIORALPHA 1 // (B) use a uniform prior for alpha; otherwise gamma prior
#define ALPHAMAX     10.0 // (d) max value of alpha if uniform prior
#define ALPHAPRIORA   1.0 // (only if UNIFPRIORALPHA==0): alpha has a gamma prior with mean A*B, and 
#define ALPHAPRIORB   2.0 // variance A*B^2.  

#define LOG10RMIN     -4.0   //(d) Log10 of minimum allowed value of r under linkage model
#define LOG10RMAX      1.0   //(d) Log10 of maximum allowed value of r
#define LOG10RPROPSD   0.1   //(d) standard deviation of log r in update
#define LOG10RSTART   -2.0   //(d) initial value of log10 r
                         
USING PRIOR POPULATION INFO (USEPOPINFO)

#define GENSBACK    2  //(int) For use when inferring whether an individual is an immigrant, or has an immigrant ancestor in the past GENSBACK generations.  eg, if GENSBACK==2, it tests for immigrant ancestry back to grandparents. 
#define MIGRPRIOR 0.01 //(d) prior prob that an individual is a migrant (used only when USEPOPINFO==1).  This should be small, eg 0.01 or 0.1.
#define PFROMPOPFLAGONLY 0 // (B) only use individuals with POPFLAG=1 to update P.

LOCPRIOR MODEL FOR USING LOCATION INFORMATION

#define LOCISPOP      1    //(B) use POPDATA for location information 
#define LOCPRIORINIT  1.0  //(d) initial value for r, the location prior
#define MAXLOCPRIOR  20.0  //(d) max allowed value for r

OUTPUT OPTIONS

#define PRINTNET     1 // (B) Print the "net nucleotide distance" to screen during the run
#define PRINTLAMBDA  1 // (B) Print current value(s) of lambda to screen
#define PRINTQSUM    1 // (B) Print summary of current population membership to screen

#define SITEBYSITE   0  // (B) whether or not to print site by site results. (Linkage model only) This is a large file!
#define PRINTQHAT    0  // (B) Q-hat printed to a separate file.  Turn this on before using STRAT.
#define UPDATEFREQ   100  // (int) frequency of printing update on the screen. Set automatically if this is 0.
#define PRINTLIKES   0  // (B) print current likelihood to screen every rep
#define INTERMEDSAVE 0  // (int) number of saves to file during run

#define ECHODATA     1  // (B) Print some of data file to screen to check that the data entry is correct.

(NEXT 3 ARE FOR COLLECTING DISTRIBUTION OF Q:)
#define ANCESTDIST   0  // (B) collect data about the distribution of ancestry coefficients (Q) for each individual
#define NUMBOXES   1000 // (int) the distribution of Q values is stored as a histogram with this number of boxes. 
#define ANCESTPINT 0.90 // (d) the size of the displayed probability interval on Q (values between 0.0--1.0)


MISCELLANEOUS

#define COMPUTEPROB 1     // (B) Estimate the probability of the Data under the model.  This is used when choosing the best number of subpopulations.
#define ADMBURNIN  500    // (int) [only relevant for linkage model]:  Initial period of burnin with admixture model (see Readme)
#define ALPHAPROPSD 0.025 // (d) SD of proposal for updating alpha
#define STARTATPOPINFO 0  // Use given populations as the initial condition for population origins.  (Need POPDATA==1).  It is assumed that the PopData in the input file are between 1 and k where k<=MAXPOPS.
#define RANDOMIZE      1  // (B) use new random seed for each run 
#define SEED        2245  // (int) seed value for random number generator (must set RANDOMIZE=0) 
#define METROFREQ    10   // (int) Frequency of using Metropolis step to update Q under admixture model (ie use the metr. move every i steps).  If this is set to 0, it is never used.(Proposal for each q^(i) sampled from prior.  The goal is to improve mixing for small alpha.)
#define REPORTHITRATE 0 //   (B) report hit rate if using METROFREQ

参数配置中：

"(int)" means that this takes an integer value.
"(B)"   means that this variable is Boolean ( 1 for True, and 0 for False)
"(str)" means that this is a string

Structure 命令行使用：

structure -m mainparams -e extraparams -K MAXPOPS -L NUMLOCI -N NUMINDS -i input file -o output file -D SEED

K 值计算
程序运行结束后可以将输出结果打包上传到下面这个网站，进行Ｋ值估算。
点击进入: StructureHarvester

3.3 亲缘关系

GWAS 混合线性模型中，亲缘关系作为协变量对结果有一定的影响，使用 Tassel / GAPIT / SPAGeDi 等软件可以轻松计算亲缘关系矩阵。

这里我们演示 SPAGeDi 1.4 软件运行过程。
软件下载地址：SPAGeDi-1.4b
为了计算快速，我们随机抽取在染色体上均匀分布的5000个标记进行 Kinship 计算，保证了标记的随机及代表性操作方法如下：

perl Step3.Random_fetch.pl Genotype_Final.hmp.txt > For_K.hmp.txt
perl Step5.Format_kinship.pl For_K.hmp.txt > K.in
## 得到的输入文件需要将 NN 转换为 0

SPAGeDi 软件得到 K 矩阵的后续处理:

1. 对角线加1
2. 小于0的系数用0替换
3. 所有系数加倍

在获得群体结构 Q 矩阵和亲缘关系 K 矩阵后，我们可以使用 Tassel / Gapit / FarmCPU / mrMLM 等软件进行分析，具体操作见下一篇。

3.4 关联分析

在GWAS研究中，当涉及质量性状时一般采用 Logistic 回归模型进行分析，对于数量性状的研究，主要采用线性回归模型进行关联分析。

在 Logistic 回归模型中，基因型是应变量，群体结构和表型是自变量；
而在线性回归模型中，表型是应变量，其他品种、性别、群体结构和基因型数据则是自变量。

线性模型包括两种：一般线性模型（general linear model，GLM）和混合线性模型(mixed linear model，MLM)。复杂数量性状通常受到多种因素的共同影响，而混合模型中可以加入固定效应和随机效应，因此，以研究数量性状的全基因组关联分析方法常采用混合线性模型进行分析。

在GWAS中，群体分层（population stratification）和多重假设检验（multiple testing adjusting）是引起研究结果分析误差的重要原因。处理这两种误差的一种可能的策略是采用基于家系的关联研究，该方法可以避免群体分成对关联分析结果的影响。所谓群体分层，是指群体内存在等位基因频率不同的亚群体。由于自然选择、遗传漂变、群体分层等诸多因素都会影响到群体中的连锁不平衡，因此，在进行关联分析时, 一些非原因等位基因也可以同真实 QTL 形成连锁不平衡表现为与研究性状关联，从而导致伪关联或假阳性的出现。

## Step3.Random_fetch.pl
#!/usr/bin/perl -w
use strict;
use Data::Dumper;

open IN, $ARGV[0] || die $!;
my $head = <IN>;
print $head;

my $total;
my %genotype;
while(<IN>){
    chomp;
    my @a = split /\t/, $_;
    $genotype{$a[2]}{$a[0]} = $_;
    if ($a[2] ne "0"){
        $total++;
    }
}
close IN;

for my $chr (1..10){
    my @snp = keys %{$genotype{$chr}};
    my $max = scalar (@snp);
    my $num = ($max / $total) * 5000;
    $num = sprintf ("%d",$num);
    
    my %random = &random($max, $num+1);

    foreach my $k (keys %random){
        my $id  = $snp[$k];
        my $out = $genotype{$chr}{$id};
        print "$out\n";
    }
}

sub random{
    my ($max, $num) = @_;
    my %hash;
    while ((keys %hash) < $num) {
        $hash{int(rand($max))} = 1;
    }
    return %hash;
}

__END__

## Step5.Format_kinship.pl
#!/usr/bin/perl -w
use strict;
use Data::Dumper;

open IN, $ARGV[0] || die $!;
chomp(my $head = <IN>);
my @samples = split /\t/, $head;

my (%hash, @snps);
while(<IN>){
    chomp;
    my $n = 1;
    my @a = split /\t/, $_;
    for (my $i=11;$i<@a;$i++){
        $a[$i] =~ tr/ATCG/1234/;
        my @tmp = split (//, $a[$i]);
        my $allen = join ",",@tmp; 
        $hash{$samples[$i]}{$a[0]} = $allen;
        
    }
    push @snps, $a[0];
}
close IN;

print "307\t0\t0\t5005\t1\t2\n0\n";

print join ("\t", "Name", sort @snps)."\n";
foreach my $sample (sort keys %hash){
    my @out;
    foreach my $snp (sort keys %{$hash{$sample}}){
        push @out,$hash{$sample}{$snp};
    }
    print join ("\t",$sample,@out)."\n";
}


__END__

#!/usr/bin/perl -w
use strict;

open IN, $ARGV[0] || die $!;
my $head = <IN>;
print $head;
while(<IN>){
    chomp;
    my @a = split /\t/, $_;
    for (my $i=1; $i<@a; $i++){
        if ($a[$i] le "0.0"){
            $a[$i] = "0.0";
        }else{
            $a[$i] = 2 * $a[$i];
        }
    }

    print join("\t", @a)."\n";
    
}
close IN;

END

缩略词：

BSSS, Iowa Stiff Stalk Synthetic;
NSS, Non-Stiff Stalk; 
PA, Group A germplasm derived from modern U.S. hybrids; 
PB, Group B germplasm derived from modern U.S. hybrids; 
SPT, Sipingtou; 
SS, Stiff Stalk; 
TS, Tropical or semitropical; 
SSR, Simple sequence repeat (markers); 
PCA, Principal component analysis;
PIC, Polymorphism information content;
RFLP, Restriction fragment length polymorphism; 
SNPs, Single nucleotide polymorphisms;
CIMMYT, International Maize and Wheat Improvement Center;
DNA,Deoxyribonucleicacid; 
GAPIT, Genomic Association and Prediction Integrated Tool; 
GBS, Genotyping by sequencing; 
GD, Genetic diversity; 
LD, Linkage disequilibrium; 
LnP(D), Log likelihood of data; 
LRC, Luda Red Cob; 
Mb, Megabase; 
NJ, Neighborjoining;

最后编辑于：2018.04.02 09:52:42

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 206,378评论 6赞 481
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 88,356评论 2赞 382
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 152,702评论 0赞 342
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 55,259评论 1赞 279
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 64,263评论 5赞 371
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 49,036评论 1赞 285
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,349评论 3赞 400
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,979评论 0赞 259
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 43,469评论 1赞 300
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,938评论 2赞 323
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 38,059评论 1赞 333
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,703评论 4赞 323
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 39,257评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 30,262评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,485评论 1赞 262
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 45,501评论 2赞 354
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,792评论 2赞 345