RNA-seq分析简洁版

前面RNA-seq分析：从软件安装到富集分析部分已经把转录组全部流程走完了一遍，这次利用RNA-seq(2)-2:下载数据中下载的肝癌数据进行分析，不在赘述细节，所以有看细节的还是请去这里。
这部分主要是代码和部分结果图，但进行了部分修正，比如KEGG 可视化部分，用了前clusterprofiler部分的结果，所以这部分不包括Gage包。
所以从质量比对开始进行

-----------------------------------------------------------------

3 sra到fastq格式转换并进行质量控制

3.1 数据解压：用samtools中的fastq-dump将sra格式转为fastq格式

备注：用时5个小时

for ((i=2;i<=5;i++));do fastq-dump --gzip --split-3 -A SRR31621$i.sra -O .;done

3.2 用fastqc进行质量控制

#将所有的数据进行质控，得到zip的压缩文件和html文件
fastqc -o .  *.fastq.gz

3.3 3 质控结果解读(为保持紧凑，只放一张图）

SRR316214.png

4 下载参考基因组及基因注释

RNA-seq(4):下载参考基因组及基因注释部分已经下载

5 序列比对：Hisat2

5.1 开始比对：用hisat2，得到SAM文件(5个小时)

我的fastq文件在/mnt/f/rna_seq/data
我的reference在/mnt/f/rna_seq/data/reference
我的index在/mnt/f/rna_seq/data/reference/index/hg19
比对后得到的bam文件会存放在/mnt/f/rna_seq/aligned

 source ~/miniconda3/bin/activate
(base) kelly@DESKTOP-MRA1M1F:/mnt/f/rna_seq/data$ cd ..
(base) kelly@DESKTOP-MRA1M1F:/mnt/f/rna_seq$
for ((i=2;i<=5;i++));do hisat2 -t -x /mnt/f/rna_seq/data/reference/index/hg19/genome -1 /mnt/f/rna_seq/data/SRR31621${i}.sra_1.fastq.gz -2 /mnt/f/rna_seq/data/SRR31621${i}.sra_2.fastq.gz -S SRR31621${i}.sam ;done

(base) kelly@DESKTOP-MRA1M1F:/mnt/f/rna_seq$ for ((i=2;i<=5;i++));do hisat2 -t -x /mnt/f/rna_seq/data/reference/index/hg19/genome -1 /mnt/f/rna_seq/data/SRR31621${i}.sra_1.fastq.gz -2 /mnt/f/rna_seq/data/SRR31621${i}.sra_2.fastq.gz -S SRR31621${i}.sam ;done
Time loading forward index: 00:00:05
Time loading reference: 00:00:01
Multiseed full-index search: 02:11:39
101339131 reads; of these:
  101339131 (100.00%) were paired; of these:
    7840870 (7.74%) aligned concordantly 0 times
    87196406 (86.04%) aligned concordantly exactly 1 time
    6301855 (6.22%) aligned concordantly >1 times
    ----
    7840870 pairs aligned concordantly 0 times; of these:
      333400 (4.25%) aligned discordantly 1 time
    ----
    7507470 pairs aligned 0 times concordantly or discordantly; of these:
      15014940 mates make up the pairs; of these:
        8711709 (58.02%) aligned 0 times
        5441734 (36.24%) aligned exactly 1 time
        861497 (5.74%) aligned >1 times
95.70% overall alignment rate

5.2 SAM文件转换为BAM文件,并对bam文件进行sort,最后建立索引：SAMtools

Sam 文件转换为bam格式（用时1.5hrs）

(base) kelly@DESKTOP-MRA1M1F:/mnt/f/rna_seq/aligned$ for ((i=2;i<=5;i++));do samtools view -S /mnt/f/rna_seq/SRR31621${i}.sam -b > SRR31621${i}.bam;done

对bam文件进行排序，默认染色体位置（用时1.5hrs）

for ((i=2;i<=5;i++));do samtools sort SRR31621${i}.bam
-o SRR31621${i}_sorted.bam;done

建立索引（用时40mins）

for ((i=2;i<=5;i++));do samtools index SRR31621${i}_sorted.bam;done

6 reads计数，合并矩阵并进行注释

6.1 bam文件按reads name排序（用时1h）

for ((i=2;i<=5;i++));do samtools sort -n SRR31621${i}.b
am -o SRR31621${i}_nsorted.bam;done

6.2 2 reads计数，得到表达矩阵`htseq-count`（用时很久很久，久到不忍写，8hrs）

注释文件如果已经解压，则不再需要重复

cd ../data/matrix
gunzip /mnt/f/rna_seq/data/reference/annotation/hg19/gencode.v19.annotation.gt.gz && rm -rf gencode.v19.annotation.gtf.gz
for ((i=2;i<=5;i++));do htseq-count -r name -f bam /mnt/f/rna_seq/aligned/SRR31621${i}_nsorted.bam /mnt/f/rna_seq/data/reference/annotation/hg19/gencode.v19.an
notation.gtf > SRR31621${i}.count; done

ls -al *.count
-rwxrwxrwx 1 root root 1197426 Aug  7 16:25 SRR316212.count
-rwxrwxrwx 1 root root 1186189 Aug  7 17:16 SRR316213.count
-rwxrwxrwx 1 root root 1200305 Aug  7 22:51 SRR316214.count
-rwxrwxrwx 1 root root 1187596 Aug  7 23:32 SRR316215.count

6.3 3 合并表达矩阵并进行基因名注释（R中进行）

Tumor:SRR316214,SRR316215
Adjacent Normal Liver:SRR316212,SRR316213

setwd("F:/rna_seq/data/matrix")
options(stringsAsFactors = FALSE)
control1<-read.table("SRR316212.count",sep = "\t",col.names = c("gene_id","control1"))
control2<-read.table("SRR316213.count",sep = "\t",col.names = c("gene_id","control2"))
treat1<-read.table("SRR316214.count",sep = "\t",col.names = c("gene_id","treat1"))
treat2<-read.table("SRR316215.count",sep = "\t",col.names = c("gene_id","treat2"))
raw_count <- merge(merge(control1, control2, by="gene_id"), merge(treat1, treat2, by="gene_id"))
head(raw_count)
raw_count_filt <- raw_count[-1:-5,]
ENSEMBL <- gsub("\\.\\d*", "", raw_count_filt$gene_id) 
row.names(raw_count_filt) <- ENSEMBL
head(raw_count_filt)

> head(raw_count_filt)
                           gene_id control1 control2 treat2.x treat2.y
ENSG00000000003 ENSG00000000003.10     4004      781      756      756
ENSG00000000005  ENSG00000000005.5        1        0        0        0
ENSG00000000419  ENSG00000000419.8      776      140      165      165
ENSG00000000457  ENSG00000000457.9      624      144      240      240
ENSG00000000460 ENSG00000000460.12      260       52      105      105
ENSG00000000938  ENSG00000000938.8      161       59       16       16

7 DEseq2筛选差异表达基因并注释

这次我换了Annotation包进行注释

7.1 载入数据（countData和colData）

# 这一步很关键，要明白condition这里是因子，不是样本名称；小鼠数据有对照组和处理组，各两个重复
condition <- factor(c(rep("control",2),rep("treat",2)), levels = c("control","treat"))
condition
colData <- data.frame(row.names=colnames(mycounts), condition)

> colData
         condition
control1   control
control2   control
treat1       treat
treat2       treat

7.2 构建dds对象,开始DESeq流程

dds <- DESeqDataSetFromMatrix(countData=mycounts, 
                              colData=colData, 
                              design= ~ condition)
dds = DESeq(dds)

7.3 总体结果查看

接下来，我们要查看treat versus control的总体结果，并根据p-value进行重新排序。利用summary命令统计显示一共多少个genes上调和下调（FDR0.1）

res = results(dds, contrast=c("condition", "control", "treat"))
res = res[order(res$pvalue),]
head(res)
summary(res)
write.csv(res,file="All_results.csv")
table(res$padj<0.01)

> summary(res)
out of 30981 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 3928, 13% 
LFC < 0 (down)   : 3283, 11% 
outliers [1]     : 0, 0% 
low counts [2]   : 10317, 33% 
(mean count < 2)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
> table(res$padj<0.01)
FALSE  TRUE 
16192  4472

可见，上调基因和下调的数目都很多，padj<0.01的有4472，即使padj<0.001的也有3089个。

7.4 提取差异表达genes（DEGs）

先看下padj（p值经过多重校验校正后的值）小于0.01，表达倍数取以2为对数后大于1或者小于-1的差异表达基因。代码如下

diff_gene_deseq2 <-subset(res, padj < 0.01 & abs(log2FoldChange) > 1)
dim(diff_gene_deseq2)
head(diff_gene_deseq2)
write.csv(diff_gene_deseq2,file= "HCC_DEG_0.05_2.csv")

> dim(diff_gene_deseq2)
[1] 3780    6
> head(diff_gene_deseq2)
log2 fold change (MLE): condition control vs treat 
Wald test p-value: condition control vs treat 
DataFrame with 6 rows and 6 columns
                 baseMean log2FoldChange     lfcSE      stat        pvalue          padj
                <numeric>      <numeric> <numeric> <numeric>     <numeric>     <numeric>
ENSG00000130649 78059.342       7.190959 0.2110024  34.07998 1.460506e-254 3.017990e-250
ENSG00000268925  5666.055       7.534595 0.2573299  29.27991 1.869559e-188 1.931628e-184
ENSG00000105697  8791.483       6.611356 0.2338215  28.27523 6.970959e-176 4.801597e-172
ENSG00000167244  8988.099       6.767261 0.2582403  26.20528 2.313301e-151 1.195051e-147
ENSG00000162366  4037.960      -7.076854 0.2704122 -26.17062 5.742578e-151 2.373293e-147
ENSG00000169715  3803.597       6.238596 0.2403373  25.95767 1.489813e-148 5.130914e-145

把padj<0.001,|log2FC|>1有2938个

8 探索分析结果:Data visulization

#MA plot
plotMA(res,ylim=c(-2,2))
topGene <- rownames(res)[which.min(res$padj)]
with(res[topGene, ], {
  points(baseMean, log2FoldChange, col="dodgerblue", cex=6, lwd=2)
  text(baseMean, log2FoldChange, topGene, pos=2, col="dodgerblue")
})

#shirnked
res_order<-res[order(row.names(res)),]
res = res_order
res.shrink <- lfcShrink(dds, contrast = c("condition","treat","control"), res=res)
plotMA(res.shrink, ylim = c(-5,5))
topGene <- rownames(res)[which.min(res$padj)]
with(res[topGene, ], {
  points(baseMean, log2FoldChange, col="dodgerblue", cex=2, lwd=2)
  text(baseMean, log2FoldChange, topGene, pos=2, col="dodgerblue")
})

#identify
idx <- identify(res$baseMean, res$log2FoldChange)
rownames(res)[idx]
#plotcounts
plotCounts(dds, gene=which.min(res$padj), intgroup="condition", returnData=TRUE)
plotCounts(dds, gene="ENSG00000130649", intgroup="condition", returnData=FALSE)
#boxplot
plotCounts(dds, gene="ENSG00000130649", intgroup="condition", returnData=TRUE) %>% 
  ggplot(aes(condition, count)) + geom_boxplot(aes(fill=condition)) + scale_y_log10() + ggtitle("CYP2E1")
#pointplot
d <- plotCounts(dds, gene="ENSG00000130649", intgroup="condition", returnData=TRUE)
ggplot(d, aes(x=condition, y=count)) + 
  geom_point(aes(color= condition),size= 4, position=position_jitter(w=0.5,h=0)) + 
  scale_y_log10(breaks=c(25,100,400))+ ggtitle("CYP2E1")

##3最小padj
d <- plotCounts(dds, gene=which.min(res$padj), intgroup="condition", 
                returnData=TRUE)
ggplot(d, aes(x=condition, y=count)) + 
  geom_point(position=position_jitter(w=0.1,h=0)) + 
  scale_y_log10(breaks=c(25,100,400))

#PCA
vsdata <- vst(dds, blind=FALSE)
plotPCA(vsdata, intgroup="condition")
#beatifule pca plot
pcaData <- plotPCA(vsdata, intgroup=c("condition"), returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color=condition, shape=condition)) +
  geom_point(size=3) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  coord_fixed()

library("pheatmap")
select<-order(rowMeans(counts(dds, normalized = TRUE)),
              decreasing = TRUE)[1:20]
df <- as.data.frame(colData(dds)[,c("condition","sizeFactor")])
# this gives log2(n + 1)
ntd <- normTransform(dds)
pheatmap(assay(ntd)[select,], cluster_rows=FALSE, show_rownames=FALSE,
         cluster_cols=FALSE, annotation_col=df)
#vsd
vsd <- vst(dds, blind=FALSE)
pheatmap(assay(vsd)[select,], cluster_rows=FALSE, show_rownames=FALSE,
         cluster_cols=FALSE, annotation_col=df)
##sample to sample heatmap
sampleDists <- dist(t(assay(vsd)))
library("RColorBrewer")
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(vsd$condition, vsd$type, sep="-")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         col=colors)

这部分主要结果部分的一些可视化，具体参考详细的部分即可，只放一张图

Rplot.jpeg

9 DEGs的富集分析(功能注释）ClusterProfiler包

##enrichment analysis using clusterprofiler package created by yuguangchuang
library(clusterProfiler)
library(DOSE)
library(stringr)
library(org.Hs.eg.db)
#get the ENTREZID for the next analysis
sig.gene= diff_gene_deseq2
head(sig.gene)
gene<-rownames(sig.gene)
head(gene)
gene.df<-bitr(gene, fromType = "ENSEMBL", 
              toType = c("SYMBOL","ENTREZID"),
              OrgDb = org.Hs.eg.db)

head(gene.df)

> head(gene.df)
          ENSEMBL   SYMBOL ENTREZID
1 ENSG00000130649   CYP2E1     1571
3 ENSG00000105697     HAMP    57817
4 ENSG00000167244     IGF2     3481
5 ENSG00000162366 PDZK1IP1    10158
6 ENSG00000169715     MT1E     4493
7 ENSG00000074276    CDHR2    54825

GO enrichment

#Go enrichment
ego_cc<-enrichGO(gene       = gene.df$ENSEMBL,
                 OrgDb      = org.Hs.eg.db,
                 keyType    = 'ENSEMBL',
                 ont        = "CC",
                 pAdjustMethod = "BH",
                 pvalueCutoff = 0.01,
                 qvalueCutoff = 0.05)
ego_bp<-enrichGO(gene       = gene.df$ENSEMBL,
                 OrgDb      = org.Hs.eg.db,
                 keyType    = 'ENSEMBL',
                 ont        = "BP",
                 pAdjustMethod = "BH",
                 pvalueCutoff = 0.01,
                 qvalueCutoff = 0.05)
barplot(ego_bp,showCategory = 18,title="The GO_BP enrichment analysis of all DEGs ")+ 
  scale_size(range=c(2, 12))+
  scale_x_discrete(labels=function(ego_bp) str_wrap(ego_bp,width = 25))

KEGG enrichment

library(stringr)
kk<-enrichKEGG(gene      =gene.df$ENTREZID,
               organism = 'hsa',
               pvalueCutoff = 0.05)

> kk[1:30]
               ID                                          Description GeneRatio  BgRatio
hsa04610 hsa04610                  Complement and coagulation cascades   38/1407  79/7431
hsa04933 hsa04933 AGE-RAGE signaling pathway in diabetic complications   40/1407  99/7431
hsa00071 hsa00071                               Fatty acid degradation   23/1407  44/7431
hsa04974 hsa04974                     Protein digestion and absorption   35/1407  90/7431
hsa05146 hsa05146                                           Amoebiasis   36/1407  96/7431
hsa04978 hsa04978                                   Mineral absorption   23/1407  51/7431
hsa04115 hsa04115                                p53 signaling pathway   29/1407  72/7431
hsa05144 hsa05144                                              Malaria   22/1407  49/7431
hsa00982 hsa00982                    Drug metabolism - cytochrome P450   28/1407  72/7431
hsa00980 hsa00980         Metabolism of xenobiotics by cytochrome P450   29/1407  76/7431
hsa03320 hsa03320                               PPAR signaling pathway   28/1407  74/7431
hsa05204 hsa05204                              Chemical carcinogenesis   30/1407  82/7431

mykegg<-barplot(kk,showCategory = 20, title="The KEGG enrichment analysis of all DEGs")+
  scale_size(range=c(2, 12))+
  scale_x_discrete(labels=function(kk) str_wrap(kk,width = 25))
 dotplot(kk,showCategory = 20, title="The KEGG enrichment analysis of all DEGs")+
  scale_size(range=c(2, 12))+
  scale_x_discrete(labels=function(kk) str_wrap(kk,width = 25))

用cowplot拼起来

library(cowplot)
plot_grid(mygobp,mykegg,labels = c("A","B"),align = "h")

Rplot01.jpeg

10 KEGG信号通路的可视化（pathview包）

备注，前面详细部分用gageData重新进行了富集分析，这部分直接调用ClusterProfiler包的富集结果

选取enrichKEGG结果pvalue排名第一的hsa4610:Complement and coagulation cascades和排名第7的hsa04115:p53 signaling pathway

library("pathview")
foldchanges = sig.gene$log2FoldChange
names(foldchanges)= gene.df$ENTREZID
head(foldchanges)
pathview(gene.data = foldchanges, pathway.id = "hsa04610", species="hsa")

------------------------------------------------------------------------------

hsa4610:Complement and coagulation cascades

hsa04610.pathview.png

------------------------------------------------------------------------------

hsa04115:p53 signaling pathway

hsa04115.pathview.png

11 把counts结果，clusterprofiler部分的symbol name 和DEGs全部合并到一个表格

备注，这部分的主要问题是没用可以`merge`用的ID，先看下

> head(mycounts)
                control1 control2 treat1 treat2
ENSG00000000003     4004      781   4229    756
ENSG00000000005        1        0      0      0
ENSG00000000419      776      140   1180    165
ENSG00000000457      624      144   1271    240
ENSG00000000460      260       52    610    105
ENSG00000000938      161       59     57     16

> head(gene.df)
          ENSEMBL   SYMBOL ENTREZID
1 ENSG00000130649   CYP2E1     1571
3 ENSG00000105697     HAMP    57817
4 ENSG00000167244     IGF2     3481
5 ENSG00000162366 PDZK1IP1    10158
6 ENSG00000169715     MT1E     4493
7 ENSG00000074276    CDHR2    54825

> head(sig.gene)
log2 fold change (MLE): condition control vs treat 
Wald test p-value: condition control vs treat 
DataFrame with 6 rows and 6 columns
                  ENSEMBL log2FoldChange     lfcSE      stat        pvalue          padj
                <numeric>      <numeric> <numeric> <numeric>     <numeric>     <numeric>
ENSG00000130649 78059.342       7.190959 0.2110024  34.07998 1.460506e-254 3.017990e-250
ENSG00000268925  5666.055       7.534595 0.2573299  29.27991 1.869559e-188 1.931628e-184
ENSG00000105697  8791.483       6.611356 0.2338215  28.27523 6.970959e-176 4.801597e-172
ENSG00000167244  8988.099       6.767261 0.2582403  26.20528 2.313301e-151 1.195051e-147
ENSG00000162366  4037.960      -7.076854 0.2704122 -26.17062 5.742578e-151 2.373293e-147
ENSG00000169715  3803.597       6.238596 0.2403373  25.95767 1.489813e-148 5.130914e-145

所以执行以下代码：

head(sig.gene)
head(gene.df)
ENSEMBL<-rownames(sig.gene)
sig.gene<-cbind(ENSEMBL, sig.gene)
colnames(sig.gene)[1]<-c("ENSEMBL")

head(mycounts)
ENSEMBL<-rownames(mycounts)
mycounts<-cbind(ENSEMBL, mycounts)
colnames(mycounts)[1]<-c("ENSEMBL")

merge1<-merge(sig.gene,mycounts,by="ENSEMBL")
merge2<-merge(merge1,gene.df, by="ENSEMBL")
head(merge2)
write.csv(DEG_symbole,file="hcc_DEGs_last_results.csv")

结果如下：

> head(merge2)
DataFrame with 6 rows and 13 columns
          ENSEMBL   baseMean log2FoldChange     lfcSE      stat       pvalue         padj  control1  control2    treat1    treat2
      <character>  <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric> <integer> <integer> <integer> <integer>
1 ENSG00000000938   70.51956       2.169409 0.4610591  4.705274 2.535248e-06 2.632581e-05       161        59        57        16
2 ENSG00000001460   77.93853      -1.496365 0.4247974 -3.522537 4.274371e-04 2.517116e-03        68        18       268        68
3 ENSG00000001561  382.90350      -1.466940 0.3397462 -4.317751 1.576269e-05 1.347065e-04       416        69      1780       227
4 ENSG00000001626   70.41775       4.713865 0.5812437  8.109965 5.063455e-16 1.944819e-14       226        61        16         1
5 ENSG00000001630  165.86393      -2.283850 0.4094980 -5.577194 2.444286e-08 3.665365e-07        84        28       442       207
6 ENSG00000002549 1683.37557       1.519578 0.2216429  6.855972 7.082930e-12 1.708432e-10      4166      1104      2171       481
       SYMBOL    ENTREZID
  <character> <character>
1         FGR        2268
2       STPG1       90529
3       ENPP4       22875
4        CFTR        1080
5     CYP51A1        1595
6        LAP3       51056

至此，这部分结果可以用来做其他很多下游分析了。

最后编辑于：2018.08.08 15:45:29

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 203,456评论 5赞 477
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,370评论 2赞 381
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 150,337评论 0赞 337
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,583评论 1赞 273
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,596评论 5赞 365
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,572评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,936评论 3赞 395
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,595评论 0赞 258
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,850评论 1赞 297
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,601评论 2赞 321
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,685评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,371评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,951评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,934评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,167评论 1赞 259
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 43,636评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,411评论 2赞 342

RNA-seq分析简洁版

-----------------------------------------------------------------

3 sra到fastq格式转换并进行质量控制

3.1 数据解压：用samtools中的fastq-dump将sra格式转为fastq格式

3.2 用fastqc进行质量控制

3.3 3 质控结果解读(为保持紧凑，只放一张图）

4 下载参考基因组及基因注释

5 序列比对：Hisat2

5.1 开始比对：用hisat2，得到SAM文件(5个小时)

5.2 SAM文件转换为BAM文件,并对bam文件进行sort,最后建立索引：SAMtools

Sam 文件转换为bam格式（用时1.5hrs）

对bam文件进行排序，默认染色体位置（用时1.5hrs）

建立索引（用时40mins）

6 reads计数，合并矩阵并进行注释

6.1 bam文件按reads name排序（用时1h）

6.2 2 reads计数，得到表达矩阵htseq-count（用时很久很久，久到不忍写，8hrs）

6.3 3 合并表达矩阵并进行基因名注释（R中进行）

7 DEseq2筛选差异表达基因并注释

这次我换了Annotation包进行注释

7.1 载入数据（countData和colData）

7.2 构建dds对象,开始DESeq流程

7.3 总体结果查看

7.4 提取差异表达genes（DEGs）

8 探索分析结果:Data visulization

9 DEGs的富集分析(功能注释）ClusterProfiler包

10 KEGG信号通路的可视化（pathview包）

备注，前面详细部分用gageData重新进行了富集分析，这部分直接调用ClusterProfiler包的富集结果

------------------------------------------------------------------------------

hsa4610:Complement and coagulation cascades

------------------------------------------------------------------------------

hsa04115:p53 signaling pathway

11 把counts结果，clusterprofiler部分的symbol name 和DEGs全部合并到一个表格

备注，这部分的主要问题是没用可以merge用的ID，先看下

结果如下：

推荐阅读更多精彩内容

6.2 2 reads计数，得到表达矩阵`htseq-count`（用时很久很久，久到不忍写，8hrs）

备注，这部分的主要问题是没用可以`merge`用的ID，先看下