现UCSC xena已经将TCGA数据汇总整理得很好了,连表达矩阵都已转换完成。
但如果有心就会发现,UCSC上的RNAseq数据有3个下载链接,以下将以
cohort: TCGA Breast Cancer (BRCA)为例做一整理说明:
https://xenabrowser.net/datapages/?cohort=TCGA%20Breast%20Cancer%20(BRCA)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443
在gene expression RNAseq下:
1:<abbr style="border-width: 0px 0px 1px; border-top-style: initial; border-right-style: initial; border-bottom-style: dotted; border-left-style: initial; border-top-color: initial; border-right-color: initial; border-bottom-color: rgb(102, 102, 102); border-left-color: initial; border-image: initial; font-family: inherit; font-size: 15px; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline; cursor: help;">IlluminaHiSeq</abbr>* (n=1,218) TCGA hub:由Illumina HiSeq 2000 RNA 测序平台完成,该数据集都已经过 log2(x+1)转换,其中x是RSEM值。raw_count是某个转录本/基因的测到的原始reads条数,normalized_count是经过标准化的数据量。做差异分析就是用normalized_count的值来做的。先根据count,利用rsem软件来计算表达量,然后根据表达量进行表达差异分析。
Gene expression RNAseq (IlluminaHiSeq pancan normalize
<abbr style="border-width: 0px 0px 1px; border-top-style: initial; border-right-style: initial; border-bottom-style: dotted; border-left-style: initial; border-top-color: initial; border-right-color: initial; border-bottom-color: rgb(102, 102, 102); border-left-color: initial; border-image: initial; font-family: inherit; font-size: 15px; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline; cursor: help;">IlluminaHiSeq pancan normalized</abbr> (n=1,218) TCGA hubd):如果分析时同时使用了其它类型肿瘤的数据,建议使用该数据,即在不同肿瘤间对数据做了处理。因为TCGA提供30-40种RNAseq,这样TCGA可以作为各种肿瘤研究的大背景。Gene expression RNAseq (IlluminaHiSeq percentile)
<abbr style="border-width: 0px 0px 1px; border-top-style: initial; border-right-style: initial; border-bottom-style: dotted; border-left-style: initial; border-top-color: initial; border-right-color: initial; border-bottom-color: rgb(102, 102, 102); border-left-color: initial; border-image: initial; font-family: inherit; font-size: 15px; font-style: inherit; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline; cursor: help;">IlluminaHiSeq percentile</abbr> (n=1,218) TCGA hub:如果需要与TCGA以外的数据进行比较,且外部数据也percentile rank进行处理,可选择该数据。
这些值percentile ranks ranges为 0 – 100之间, 值越小表示表达越低. 大家可以结合 TCGA RNAseq 数据 与自己的RNAseq data, perform normalization across the combined dataset using whatever method you choose, then analyze the combined dataset further.可自行选择合适的方法进行标准化,然后进行进一步的分析。TCGA Pan-Cancer gene expression
For comparison across multiple or all TCGA cohorts. Dataset is generated at UCSC by combining “gene expression RNAseq (IlluminaHiSeq) data” (see above) from all TCGA cohorts. No further normalization is performed。(具体使用待查)。
** TCGA下载文件中都代表哪些值?**
| Example filename | Values in file |
| TCGA_KIRC_exp_HiSeqV2 | Log2(x+1), x is the RSEM value |
| TCGA_KIRC_exp_HiSeqV2_PANCAN | Log2(x+1) value mean-normalized per-gene across all TCGA samples, extracted converted values only belong to this cohort. x is the RSEM value |
| TCGA_KIRC_exp_HiSeqV2_percentile | Percentile ranking of RSEM value per sample, values range from 0 to 100, lower values representing lower expression |
| TCGA_KIRC_gistic2 | Gistic2 value from Broad Firehose |
| TCGA_KIRC_gistic2thd | Gistic2 value discretized to -2,-1,0,1,2 by Broad Firehose |
| TCGA_KIRC_hMethyl27 | beta values |
| TCGA_KIRC_hMethyl450 | beta values |
| TCGA_KIRC_miRNA | Log2(x+1), x is RPKM value |
| TCGA_KIRC_mutation | PANCAN AWG somatic mutation calls |
| TCGA_KIRC_PDMRNAseq | Pathway inference score derived using RNAseq data alone (generated at Firehose) |
| TCGA_KIRC_PDMRNAseqCNV | Pathway inference score derived using RNAseq and copy number data (generated at Firehose) |
| TCGA_KIRC_RPPA | RPPA value |
| TCGA_KIRC_RPPA_RBN | RBN-normalized RPP |
广而告之
说一个事,鉴于简书平台在信息传播方面有不足之处,应粉丝要求,白介素2的个人微信平台已经开启,继续聊临床与科研的故事,R语言,数据挖掘,文献阅读等内容。当然也不要期望过高,微信平台目前的定位是作为自己的读书笔记,如果对大家有帮助最好。如果感兴趣, 可以扫码关注下。