Copy Number Variation Analysis Pipeline

简介：
This pipeline is built onto the existing TCGA level 2 data generated by Birdsuite and uses the DNAcopy R-package to perform a circular binary segmentation (CBS) analysis [1]. The GDC further transforms these copy number values into segment mean values, which are equal to log2(copy-number/ 2). Diploid regions will have a segment mean of zero, amplified regions will have positive values, and deletions will have negative values.

The GRCh38 probe-set was produced by mapping probe sequences to the GRCh38 reference genome and can be downloaded at the GDC Reference File Website.

Copy Number Estimation

Numeric focal-level Copy Number Variation (CNV) values were generated with "Masked Copy Number Segment" files from tumor aliquots using GISTIC2 [2], [3] on a project level. Only protein-coding genes were kept, and their numeric CNV values were further thresholded by a noise cutoff of 0.3:

Genes with focal CNV values smaller than -0.3 are categorized as a "loss" (-1)
Genes with focal CNV values larger than 0.3 are categorized as a "gain" (+1)
Genes with focal CNV values between and including -0.3 and 0.3 are categorized as "neutral" (0).

GISTIC2 Command Line Parameters

gistic2 \
-b <base_directory> \
-seg <segmentation_file> \
-mk <marker_file> \
-refgene <reference_gene_file> \
-ta 0.1 \
-armpeel 1 \
-brlen 0.7 \
-cap 1.5 \
-conf 0.99 \
-td 0.1 \
-genegistic 1 \
-gcm extreme \
-js 4 \
-maxseg 2000 \
-qvt 0.25 \
-rx 0 \
-savegene 1 \
(-broad 1)

[1] Olshen, Adam B., E. S. Venkatraman, Robert Lucito, and Michael Wigler. "Circular binary segmentation for the analysis of array-based DNA copy number data." Biostatistics 5, no. 4 (2004): 557-572.
[2] Mermel, Craig H., Steven E. Schumacher, Barbara Hill, Matthew L. Meyerson, Rameen Beroukhim, and Gad Getz. "GISTIC2. 0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers." Genome biology 12, no. 4 (2011): R41.
[3] Beroukhim, Rameen, Craig H. Mermel, Dale Porter, Guo Wei, Soumya Raychaudhuri, Jerry Donovan, Jordi Barretina et al. "The landscape of somatic copy-number alteration across human cancers." Nature 463, no. 7283 (2010): 899.

根据曾老师的例子下载数据

wget -c -r -np -nH -k -L --cut-dirs 6 -p  -A "*snp_6*hg19*Level_3*"  http://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/BRCA/20160128/

如果要下载其它癌症种类，只需要改变url里面的BRCA即可。如果要下载其它类型的数据，只需要改变-A 后面的匹配规则即可，其实就是打开上面url看到的几十个文件的文件名的规律。
打开网站可以看到

GDC癌症列表

$ cd Downloads/
$ wget -c -r -np -nH -k -L --cut-dirs 6 -p  -A "*snp_6*hg19*Level_3*"  http://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/BRCA/20160128/
evel_3.2016012800.0  28%[====>               ]   5.19M  7.15KB/s    eta 23m 51s

下载速度限制，估计需要下载半个小时，如果用加速插件估计会快点。反正是在探索，而且时间也不是太长，就继续看教程
下载完成总共耗时50分钟

FINISHED --2019-11-08 10:39:06--
Total wall clock time: 50m 13s
Downloaded: 17 files, 19M in 50m 0s (6.62 KB/s)
Converted links in 0 files in 0 seconds.
(test) Cheng-MacBook-Pro:Downloads chelsea$

ls查看下载内容如下

gdac.broadinstitute.org_BRCA-FFPE.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg19__seg.Level_3.2016012800.0.0.tar.gz
gdac.broadinstitute.org_BRCA-FFPE.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg19__seg.Level_3.2016012800.0.0.tar.gz.md5
gdac.broadinstitute.org_BRCA-FFPE.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg19__seg.Level_3.2016012800.0.0.tar.gz
gdac.broadinstitute.org_BRCA-FFPE.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg19__seg.Level_3.2016012800.0.0.tar.gz.md5
gdac.broadinstitute.org_BRCA.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg19__seg.Level_3.2016012800.0.0.tar.gz
gdac.broadinstitute.org_BRCA.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg19__seg.Level_3.2016012800.0.0.tar.gz.md5
gdac.broadinstitute.org_BRCA.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg19__seg.Level_3.2016012800.0.0.tar.gz
gdac.broadinstitute.org_BRCA.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg19__seg.Level_3.2016012800.0.0.tar.gz.md5

可以看到，下载的文件中有一个名称里面包含Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg19__seg.Level_3.2016012800.0.0.,其中minus了germline的CNV的就是需要分析的somatic CNV.

#先安装一下bedtools
(test) Cheng-MacBook-Pro:Downloads chelsea$ cd ~
(test) Cheng-MacBook-Pro:~ chelsea$ cd biosoft/
(test) Cheng-MacBook-Pro:biosoft chelsea$ conda install -y bedtools
...
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
(test) Cheng-MacBook-Pro:biosoft chelsea$

拿到CNV做什么？

首先两个segment文本文件已经可以直接载入IGV查看所有BRCA样本的CNV情况，如下所示：

还可以进行CNV深度分析

注释基因

既然是对基因组片段做基因注释，那么首先就需要拿到基因的坐标信息咯，我是在gencode数据库里面下载，然后解析成下面的bed格式的，如下：

head ~/reference/gtf/gencode/protein_coding.hg19.position 
chr1    69091   70008   OR4F5
chr1    367640  368634  OR4F29
chr1    621096  622034  OR4F16
chr1    859308  879961  SAMD11
chr1    879584  894689  NOC2L
chr1    895967  901095  KLHL17
chr1    901877  911245  PLEKHN1
chr1    910584  917473  PERM1
chr1    934342  935552  HES4
chr1    936518  949921  ISG15

然后要把下载的CNV文本文件，转为bed格式的，就是把列的顺序调换一下：

head Features.bed  
chr1    3218610 95674710    TCGA-3C-AAAU-10A-01D-A41E-01    53225   0.0055
chr1    95676511    95676518    TCGA-3C-AAAU-10A-01D-A41E-01    2   -1.6636
chr1    95680124    167057183   TCGA-3C-AAAU-10A-01D-A41E-01    24886   0.0053
chr1    167057495   167059336   TCGA-3C-AAAU-10A-01D-A41E-01    3   -1.0999
chr1    167059760   181602002   TCGA-3C-AAAU-10A-01D-A41E-01    9213    -8e-04
chr1    181603120   181609567   TCGA-3C-AAAU-10A-01D-A41E-01    6   -1.2009
chr1    181610685   201473647   TCGA-3C-AAAU-10A-01D-A41E-01    12002   0.0055
chr1    201474400   201474544   TCGA-3C-AAAU-10A-01D-A41E-01    2   -1.4235
chr1    201475220   247813706   TCGA-3C-AAAU-10A-01D-A41E-01    29781   -4e-04

避免重复造轮子，曾老师用其擅长的bedtools解决这个需求，命令如下：

bedtools intersect -a Features.bed  -b  ~/reference/gtf/gencode/protein_coding.hg19.position  -wa -wb  \
| bedtools groupby -i - -g 1-4 -c 10 -o collapse

注释结果，可以看到，每个CNV片段都注释到了对应的基因，有些特别大的片段，会被注释到非常多的基因。

chr8    42584924    42783715    TCGA-5T-A9QA-01A-11D-A41E-01    CHRNB3,CHRNA6,THAP1,RNF170,HOOK3
chr8    42789728    42793594    TCGA-5T-A9QA-01A-11D-A41E-01    HOOK3
chr8    42797957    42933372    TCGA-5T-A9QA-01A-11D-A41E-01    RP11-598P20.5,FNTA,HOOK3
chr8    70952673    70964372    TCGA-5T-A9QA-01A-11D-A41E-01    PRDM14
chr10   42947970    43833200    TCGA-5T-A9QA-01A-11D-A41E-01    BMS1,RET,RASGEF1A,ZNF33B,CSGALNACT2
chr10   106384615   106473355   TCGA-5T-A9QA-01A-11D-A41E-01    SORCS3
chr10   106478366   107298256   TCGA-5T-A9QA-01A-11D-A41E-01    SORCS3
chr10   117457285   117457859   TCGA-5T-A9QA-01A-11D-A41E-01    ATRNL1
chr11   68990173    69277078    TCGA-5T-A9QA-01A-11D-A41E-01    MYEOV
chr11   76378708    76926535    TCGA-5T-A9QA-01A-11D-A41E-01    LRRC32,B3GNT6,OMP,TSKU,MYO7A,ACER3,CAPN5

找SOMATIC CNVS

仔细看上面IGV的pattern你会发现某些染色体的某些片段经常会扩增或者缺失，这个现象就是人们想研究是recurrent CNV regions，当然不会用肉眼看咯，这时候需要用GISTIC这个软件。找到了recurrent CNV regions同样是需要进行基因注释，才能进行下游分析咯。

PICNIC软件用法：http://www.bio-info-trainee.com/1299.html
GISTIC软件用法：http://www.bio-info-trainee.com/1648.html
TCGA : http://www.biotrainee.com/thread-1696-1-1.html

从今天的教程中知道了CNV的分析还需要用到的软件：IGV，bedtools,picnic,gistic,又有得学了。

2019-11-08 记录CNV数据分析学习（四）