Protocal for quality control and cut adaptor of RNA sequence
1. 数据预处理
1.1 FastQC质控
FastQC - A high throughput sequence QC analysis tool
FastQC reads a set of sequence files and produces from each one a quality control report consisting of a number of different modules, each one of which will help to identify a different potential type of problem in your data.
usage: fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] [-c contaminant file] seqfile1 .. seqfileN
-o --outdir
Create all output files in the specified output directory. Please note that this directory must exist as the program will not create it. If this option is not set then the output file for each sequence file is created in the same directory as the sequence file which was processed.
mkdir fastqc_result.raw #(建立输出文件夹)
fastqc -q -t 3 -o ../fastqc_result.raw/ *.fq.gz &
fastqc -t 3 -o ./fastqc_result.raw/ *.fq.gz
#(使用fastqc质控软件,对所有raw_data进行质控检测)
1.2 MultiQC整合质控文件
MultiQC aggregates results from bioinformatics analyses across many samples into a single report.
It searches a given directory for analysis logs and compiles a HTML report. It's a general use tool, perfect for summarising the output from numerous bioinformatics tools.
usage: multiqc [OPTIONS] <analysis directory>
-o, --outdir TEXT
Create report in the specified output directory.
For tips for MultiQC, please visit the following website: https://www.youtube.com/watch?v=qPbIlO_KWN0
multiqc -o ./MultiQC ./
#(这一步就是对 fastqc_reault 文件夹下所有文件进行整合)
1.3 FASTX-Toolkit
根据质控结果,去除不良序列
The FASTX-Toolkit is a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing. Detailed information was provided in http://hannonlab.cshl.edu/fastx_toolkit/.
This section used the 'FASTQ/A Trimmer' function of FASTX-Toolkit/FASTX-Toolkit.
usage: fastx_trimmer [-h] [-f N] [-l N] [-z] [-v] [-i INFILE] [-o OUTFILE]
[-f N] = First base to keep. Default is 1 (=first base).
[-l N] = Last base to keep. Default is entire read.
[-z] = Compress output with GZIP.
[-i INFILE] = FASTA/Q input file. default is STDIN.
[-o OUTFILE] = FASTA/Q output file. default is STDOUT.
fastx_trimmer命令不支持压缩格式的输入文件,因此需要先zcat处理
Tips: | 为管道函数
zcat hela_ctrl_rep1_R1.fq.gz | fastx_trimmer -f 11 -l 140 -z -o out_ctrl_rep1_R1.fq.gz &
zcat hela_ctrl_rep2_R1.fq.gz | fastx_trimmer -f 11 -l 140 -z -o out_ctrl_rep2_R1.fq.gz &
zcat hela_ctrl_rep2_R2.fq.gz | fastx_trimmer -f 11 -l 140 -z -o out_ctrl_rep2_R2.fq.gz &
zcat hela_ctrl_rep1_R2.fq.gz | fastx_trimmer -f 11 -l 140 -z -o out_ctrl_rep1_R2.fq.gz &
zcat hela_treat_rep1_R1.fq.gz | fastx_trimmer -f 11 -l 140 -z -o out_treat_rep1_R1.fq.gz &
zcat hela_treat_rep1_R2.fq.gz | fastx_trimmer -f 11 -l 140 -z -o out_treat_rep1_R2.fq.gz &
zcat hela_treat_rep2_R1.fq.gz | fastx_trimmer -f 11 -l 140 -z -o out_treat_rep2_R1.fq.gz &
zcat hela_treat_rep2_R2.fq.gz | fastx_trimmer -f 11 -l 140 -z -o out_treat_rep2_R2.fq.gz &
1.4 Cutadapt
去掉read两端的adaptor
Install: python3 -m pip install --user --upgrade cutadapt
Detailed information was provided in: https://cutadapt.readthedocs.io/en/stable/guide.html#basic-usage
Usage: cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq
Tips: nohup:不挂断地运行命令
--times 1 一条序列只去一次Adaptor
-e 0.1 在匹配时可以有10%的错误率
-o 3 adaptor序列必须和测序序列有3个碱基以上的overlap才可以
-quality-cutoff 常用6
-m 50如果处理之后低于50的话就扔掉序列,短序列测序质量可能不是很好;
-a和-A是 Illumina 常用的通用引物,之所以输入两个,是因为是一个双端测序的结果,需要对两个文件内容进行分别去除,
-a对应 reads1
-A对应 reads2
-o 上一步输出 fastq1
-p 上一步输出 fastq2
nohup cutadapt --times 1 -e 0.1 -o 3 --quality-cutoff 6 -m 50 -a AGATCGGAAGAGC -A AGATCGGAAGAGC -o cut_out_ctrl_rep1_R1.fq.gz -p cut_out_ctrl_rep1_R2.fq.gz out_ctrl_rep1_R1.fq.gz out_ctrl_rep1_R2.fq.gz &
nohup cutadapt --times 1 -e 0.1 -o 3 --quality-cutoff 6 -m 50 -a AGATCGGAAGAGC -A AGATCGGAAGAGC -o cut_out_ctrl_rep2_R1.fq.gz -p cut_out_ctrl_rep2_R2.fq.gz out_ctrl_rep2_R1.fq.gz out_ctrl_rep2_R2.fq.gz &
nohup cutadapt --times 1 -e 0.1 -o 3 --quality-cutoff 6 -m 50 -a AGATCGGAAGAGC -A AGATCGGAAGAGC -o cut_out_treat_rep1_R1.fq.gz -p cut_out_treat_rep1_R2.fq.gz out_treat_rep1_R1.fq.gz out_treat_rep1_R2.fq.gz &
nohup cutadapt --times 1 -e 0.1 -o 3 --quality-cutoff 6 -m 50 -a AGATCGGAAGAGC -A AGATCGGAAGAGC -o cut_out_treat_rep2_R1.fq.gz -p cut_out_treat_rep2_R2.fq.gz out_treat_rep2_R1.fq.gz out_treat_rep2_R2.fq.gz &
2. STAR 建立 index
install: conda install -c bioconda star
Usage:
STAR [options]... --genomeDir /path/to/genome/index/ --readFilesIn R1.fq R2.fq
2.1 STAR align标准用法
STAR --runThreadN 6 --runMode genomeGenerate --genomeDir ./STARindex --genomeFastaFiles ./Homo_sapiens.GRCh38.dna.primary_assembly.fa --sjdbGTFfile ./Homo_sapiens.GRCh38.104.gtf --sjdbOverhang 100
详解:
STAR
--runThreadN 8 #线程数
--runMode genomeGenerate
--genomeDir ./STARindex #index输出的路径
--genomeFastaFiles ./Homo_sapiens.GRCh38.dna.primary_assembly.fa #参考基因组
--sjdbGTFfile ./Homo_sapiens.GRCh38.104.gtf #参考基因组注释文件
--sjdbOverhang 100 #reads长度的最大值减1,默认是100,一般为reads长度-1(150-1)
2.2 STAR align 简单版本
STAR --runThreadN 8 \
--genomeDir ./STARindex \
--readFilesCommand zcat \
--readFilesIn hela_treat_rep2_R1.fq.gz hela_treat_rep2_R2.fq.gz \
--outFileNamePrefix ./Alignment/treat_rep2_ \
--outSAMtype BAM SortedByCoordinate \
--outBAMsortingThreadN 8 \
--quantMode TranscriptomeSAM GeneCounts
详解
STAR
--runThreadN 10
--genomeDir ./STARindex #索引目录
--readFilesCommand zcat #执行解压缩
--readFilesIn cut_out_ctrl_rep1_R1.fq.gz cut_out_ctrl_rep1_R2.fq.gz #输入paired reads文件
--outFileNamePrefix ./Alignment/sample1_ #输出文件加前缀
--outSAMtype BAM SortedByCoordinate #表示输出默认排序的bam文件
--outBAMsortingThreadN 5 #设置排序中的线程数
--quantMode TranscriptomeSAM GeneCounts #将上面基因组比对的信息转化为转录本比对的信息,count
2.3 samtools
samtools (Tools for alignments in the SAM format)
查看bam 文件,每个基因上有多少个reads 的统计信息的bam文件,从原始的read信息到比对文件的过程
samtools view sample2_Aligned.sortedByCoord.out.bam |head
3. 表达定量
3.1 RSEM + featurecount
3.1.1 RSEM
转录组数据计算表达定量
RSEM is a software package for estimating gene and isoform expression levels from RNA-Seq data. The RSEM package provides an user-friendly interface, supports threads for parallel computation of the EM algorithm, single-end and paired-end read data, quality scores, variable-length reads and RSPD estimation. In addition, it provides posterior mean and 95% credibility interval estimates for expression levels. For visualization, It can generate BAM and Wiggle files in both transcript-coordinate and genomic-coordinate. Genomic-coordinate files can be visualized by both UCSC Genome browser and Broad Institute's Integrative Genomics Viewer (IGV). Transcript-coordinate files can be visualized by IGV. RSEM also has its own scripts to generate transcript read depth plots in pdf format. The unique feature of RSEM is, the read depth plots can be stacked, with read depth contributed to unique reads shown in black and contributed to multi-reads shown in red. In addition, models learned from data can also be visualized. Last but not least, RSEM contains a simulator.
Install: conda install -c bioconda rsem
Rsem prepare reference
从参考基因组中提取原始准备文件
rsem-prepare-reference --gtf \
./Homo_sapiens.GRCh38.104.gtf \ ./Homo_sapiens.GRCh38.dna.primary_assembly.fa \
./arab_RSEM/arab_rsem #保存位置
rsem-calculate-expression --paired-end --no-bam-output --alignments -p 8 -q ./Alignment/treat_rep2_Aligned.toTranscriptome.out.bam ./arab_RSEM/arab_rsem rsem_out/treat_rep2_1rsem
3.1.2 featurecount
install conda install -c bioconda subread
featureCounts
-p #双端
-a ../00ref/Araport11_GFF3_genes_transposons.201606.gtf #基因组注释文件
-o out_counts.txt #输出文件
-T 8 #使用的线程数
-t exon #通过什么来进行定量,输出什么结果 所有要定量的文件
-g gene_id sample*Aligned.sortedByCoord.out.bam
featureCounts -p -a ././Homo_sapiens.GRCh38.104.gtf -o out_counts.txt -T 8 -t exon -g gene_id ./Alignment/*Aligned.sortedByCoord.out.bam
featureCounts -p -a ../00ref/genes.gtf -o Z-13-1_out_counts.txt -T 10 -t exon -g *L1*dm6_Aligned.sortedByCoord.out.bam
featureCounts -p -a ../00ref/genes.gtf -o out_counts.txt -T 20 -t exon -g gene_id
featureCounts -p -a ../00ref/genes.gtf -o out_counts.txt -T 6 -t exon -g gene_id zgq-*_accepted_hits.bam
构建表达矩阵
mkdir 06deseq_out
rsem-generate-data-matrix *rsem.genes.results > output.matrix
#删除未检测到表达的基因
awk 'BEGIN{printf "geneid\ta1\ta2\tb1\tb2\n"}{if($2+$3+$4+$5>0)print $0}' output.matrix > deseq2_input.txt
#查看文件的行数
wc -l output.matrix
wc -l deseq2_input.txt
mv deseq2_input.txt ../06deseq_out/
mv deseq2.r ../06deseq_out/
abundance_estimates_to_matrix.pl
run_DE_analysis.pl
3.2 kallisto
install conda install -c bioconda kallisto
不比对,直接定量
先构建索引 index,/arab_kallisto目录下
kallisto index -i arab_kallisto ./arab_RSEM/arab_rsem.transcripts.fa
kallisto进行定量
kallisto quant -i Dro_kallisto/Dro_kallisto -o 05kallisto_out/sample2 02clean_data/R1.gz 02clean_data/R2.gz
kallisto quant -i Dro_kallisto/Dro_kallisto -o 05kallisto_out/Z-1 ../../Data_cy/20181018-embryo-RIP-seq\ data/Z-13-1_L1_310310.R1.clean.fastq.gz ../../Data_cy/20181018-embryo-RIP-seq\ data/Z-13-1_L1_310310.R2.clean.fastq.gz
#abundance.tsv 列:gene counts TPM