RNA-Seq从入门到放弃

本篇推文用于新手清晰了解并掌握植物RNAseq数据分析流程

一、测序数据的介绍

测序数据主要有两个来源 1、自测的测序数据；2、SRA数据库下载；这里介绍SRA数据库下载。

SRA数据库下载

第一步：获取SRR*******

第二步：安装SRA Toolkit

conda install sra-tools

第三步：下载数据

for i in `seq 2085 2181`;do nohup prefetch SRR650${i} &;done

prefetch程序会将数据下载存储在$home/ncbi/public/sra/目录下

第四步：sra转化成fastq格式

单端测序（single-end）

fastq-dump SRR6502085.sra

双端测序（pair-end）

fastq-dump --split-3 SRR6502085.sra

写个循环批量处理

for file in SRR*;do fastq-dump --split-3 $file;done

md5sum校验文件完整性

md5sum *.fastq > all.md5

md5sum -c all.md5

二、数据的质控与过滤

1、测序数据质量查看

下载FastQC

conda install FastQC

使用FastQC查看数据质量

fastqc -t 8 SRR*.fastq

将输出的html文件下载到本地查看

关于FastQC的内容请参考，这里我就不加赘述：https://zhuanlan.zhihu.com/p/88655260

2、数据质控和过滤

安装fastp

conda install fastp

使用fastp对数据进行质控和过滤

fastp -i SRR6502085.fastq -o SRR6502085.clean.fastq

写个循环批量处理

for file in `ls SRR*.fastq | perl -lpe 's/.fastq//'`;do nohup fastp -i ${file}.fastq -o ${file}.clean.fastq &;done

3、对比质控前后的数据

可以查看文件大小，感受质控前后变化

ls -ahl *.fastq

三、参考基因组的准备

主要用到两个文件，不同的物种有不同的基因组数据，我这是水稻的参考基因组以及注释文件。

安装gffread

conda install gffread

将gff3文件转换为gtf文件

gffread IRGSP-1.0_genome.gff -T -o IRGSP-1.0_genome.gtf

更多关于GFF与GTF格式的内容请参考：http://ccb.jhu.edu/software/stringtie/gff.shtml

四、转录组数据分析

1、为参考基因组构建索引

下载 hisat2

conda install hisat2

构建索引

hisat2-build IRGSP-1.0_genome.fasta IRGSP-1.0_genome > hisat2-build.log 2> hisat2-build.err

2、继续使用hisat2进行比对

新建hisat2文件夹

mkdir hisat2

循环批量操作

for sample in `ls SRR*.clean.fastq | perl -lpe 's/SRR*.clean.fastq//'`; do hisat2 --new-summary -p 2 -x IRGSP-1.0_genome -U $sample -S ./hisat2/${sample}.sam 2> ./hisat2/${sample}.err; done

3、比对结果格式转换及构建索引

将SAM文件转换为BAM文件

samtools view -b SRR6502085.sam > SRR6502085.bam

排序

samtools sort SRR6502085.bam > SRR2176358.sorted.bam

对BAM文件创建索引

samtools index SRR6502085.sorted.bam

bam和sam文件均可使用IGV查看

循环批量操作

for file in *.sam;do

samtools view -b $file > ${file%.sam}.bam

samtools sort ${file%.sam}.bam > ${file%.sam}.sorted.bam

samtools index ${file%.sam}.sorted.bam

done

五、基因表达量估计

1、表达定量和标准化

新建featurecounts文件夹

mkdir featurecounts

cd ~/Rna_seq/SRR650208/featurecounts

编写R脚本

vim featurecounts.R

循环批量操作

for file in ~/Rna_seq/SRR650208/sorted/*.sorted.bam; do

Rscript ./featurecounts/featurecounts.R

~/Rna_seq/SRR650208/sorted/$file

~/Rna_seq/SRR650208/references/IRGSP-1.0_genome.corrected.gtf 10 $file;

done

2、合并所有count文件

perl -lne 'if ($ARGV=~/(.*).count/){print "$1\t$_"}' *.count > merge.count

3、提取counts值及tpm值并生成矩阵

awk -F"\t" '{print $1"\t"$2"\t"$3}' merge.count > gene.count

awk -F"\t" '{print $1"\t"$2"\t"$5}' merge.count > gene.tpm

编写R脚本matrix_count.R

a=read.csv('gene.count',header=F,sep="\t")

colnames(a)=c('sample','gene','counts')

library(reshape2)

counts=dcast(a,formula=gene~sample)

write.table(counts,file="gene_count.csv",sep="\t",quote=FALSE,row.names=FALSE)

运行matrix_count.R脚本

Rscript matrix_count.R

生成gene_count.csv

第1列为基因名；后面是对应的count数

编写R脚本matrix_tpm.R

a=read.csv('tpm.count',header=F,sep="\t")

colnames(a)=c('sample','gene','tpm')

library(reshape2)

counts=dcast(a,formula=gene~sample)

write.table(counts,file="gene_tpm.csv",sep="\t",quote=FALSE,row.names=FALSE

运行matrix_count.R脚本

Rscript matrix_tpm.R

生成gene_tpm.csv

第1列为基因名；后面是对应的tpm数

有参转录组的上游分析到此为止，接下来便是差异表达、后续个性化分析及可视化作图了。

欢迎通过我的公众号（生信小破船）更及时了解更多信息