宏基因组分析:从Raw data到profile的简单流程

简介

该宏基因组流程主要有四步,分别是1.检查raw data;2.获得高质量reads;3.合并PE数据;4. reads map到参考数据库得到profile。

初步想法:

  1. 先分别撰写每一步的基础脚本,如过滤,mapping等过程的脚本,只针对单样本;与此同时,设计好输入文件的格式;
  2. 接着脚本内部每个样本生成每个步骤的脚本,如sample1.trim.sh sample1.map.sh
  3. 然后将每步的脚本放置一起形成该步骤的综合脚本,如 step1.trim.sh
  4. 最后将含有每样本的各步骤的脚本综合在一起,为Run.all.sh

文件结构:脚本和结果文件

../MetaGenomics_pipeline/
├── bin               # 脚本
│   ├── humann.pl
│   ├── kneaddata.pl
│   ├── merge.pl
│   ├── metaphlan.pl
│   └── qc.pl
├── main.pl           # 主程序 
├── result            # 结果 
│   ├── 00.quality
│   │   ├── fastqc
│   │   └── multiqc
│   ├── 01.kneaddata
│   ├── 02.merge
│   ├── 03.humann
│   │   ├── genefamilies
│   │   ├── log
│   │   ├── metaphlan
│   │   ├── pathabundance
│   │   └── pathcoverage
│   ├── 04.metaphlan
│   ├── Run.s1.qc.sh
│   ├── Run.s2.kneaddata.sh
│   ├── Run.s3.merge.sh
│   ├── Run.s4.humann.sh
│   ├── Run.s5.metaphlan.sh
│   └── script      # 每个样本的每一步脚本
│       ├── 00.quality
│       ├── 01.kneaddata
│       ├── 02.merge
│       ├── 03.humann
│       └── 04.metaphlan
├── Run.all.sh
├── test.tsv
├── TruSeq2-PE.fa -> /data/share/anaconda3/share/trimmomatic/adapters/TruSeq2-PE.fa
└── work.sh        # 启动脚本

步骤

先准备输入数据

find /RawData/ -name "*fq.gz" |  sort | perl -e 'print "SampleID\tLaneID\tPath\n"; while(<>){chomp; $fq=(split("\/", $_))[-1]; $sampleid=$fq; $laneid=$fq; $sampleid=~s/\_R[1|2]\.fq.gz//g; $laneid=~s/\.fq.gz//g;print "$sampleid\t$laneid\t$_\n";}' > samples.fqpath.tsv
SampleID LaneID Path
ND2 ND2_R1 RawData/ND2_R1.fq.gz
ND2 ND2_R2 RawData/ND2_R2.fq.gz
XL10 XL10_R1 RawData/XL10_R1.fq.gz
XL10 XL10_R2 RawData/XL10_R2.fq.gz
XL11 XL11_R1 RawData/XL11_R1.fq.gz
XL11 XL11_R2 RawData/XL11_R2.fq.gz
XL1 XL1_R1 RawData/XL1_R1.fq.gz
XL1 XL1_R2 RawData/XL1_R2.fq.gz
XL2 XL2_R1 RawData/XL2_R1.fq.gz

Scan raw data

qc.pl: 使用fastqc和multiqc软件对raw data进行扫描,输入数据是 samples.fqpath.tsv,使用perl编程。

#!/usr/bin/perl 

use warnings;
use strict;
use Getopt::Long;

my ($file, $real_dir, $out, $help, $version);
GetOptions(
    "f|file:s"  =>  \$file,
    "d|real_dir:s"  => \$real_dir,
    "o|out:s"   =>  \$out,
    "h|help:s"  =>  \$help,
    "v|version" =>  \$version
);
&usage if(!defined $out);

# output
my $dir_qc = "$real_dir/result/00.quality/fastqc"; 
system "mkdir -p $dir_qc" unless(-d $dir_qc);

# script
my $dir_script = "$real_dir/result/script/00.quality/"; 
system "mkdir -p $dir_script" unless(-d $dir_script);

my @array_name;
open(IN, $file) or die "can't open $file\n";
open(OT, "> $out") or die "can't open $out\n";
<IN>;
while(<IN>){
    chomp;
    my @tmp = split("\t", $_);
    if(-e $tmp[2]){
        my $bash = join("", $dir_script, $tmp[1], ".fastqc.sh");
        open(OT2, "> $bash") or die "can't open $bash\n";
        print OT2 "fastqc -o $dir_qc --noextract $tmp[2]\n";
        close(OT2);

        print OT "sh $bash\n";
    }
}
close(IN);

my $dir_mc = "$real_dir/result/00.quality/multiqc"; 
system "mkdir -p $dir_mc" unless(-d $dir_mc);
print OT "multiqc $dir_qc --outdir $dir_mc\n";
close(OT);


sub usage{
    print <<USAGE;
usage:
    perl $0 -f <file> -d <real_dir> -o <out> 
options:
    -f|file     :[essential].
    -d|real_dir :[essential].    
    -o|out      :[essential].
USAGE
    exit;
};

sub version {
    print <<VERSION;
    version:    v1.1
    update:     20201223 - 20201224
    author:     zouhua1\@outlook.com
VERSION
};

trim low quality reads and remove host DNA

kneaddata.pl:kneaddata内部自带过滤和比对软件

#!/usr/bin/perl 

use warnings;
use strict;
use Getopt::Long;

my ($file, $real_dir, $out, $adapter, $help, $version);
GetOptions(
    "f|file:s"   =>  \$file,
    "d|real_dir:s"  => \$real_dir,
    "a|adapt:s"  =>  \$adapter,  
    "o|out:s"    =>  \$out,
    "h|help:s"   =>  \$help,
    "v|version"  =>  \$version
);
&usage if(!defined $out);

my $dir = "$real_dir/result/01.kneaddata"; 
system "mkdir -p $dir" unless(-d $dir);

# script
my $dir_script = "$real_dir/result/script/01.kneaddata/"; 
system "mkdir -p $dir_script" unless(-d $dir_script);

open(IN, $file) or die "can't open $file";
my %file_name;
<IN>;
while(<IN>){
    chomp;
    my @tmp = split("\t", $_);
    push (@{$file_name{$tmp[0]}}, $tmp[2]);
}
close(IN);

my ($fq1, $fq2);
open(OT, "> $out") or die "can't open $out\n";
foreach my $key (keys %file_name){
    if (${$file_name{$key}}[0] =~ /R1/){
        $fq1 = ${$file_name{$key}}[0];
    }else{
        $fq2 = ${$file_name{$key}}[0];
    }

    if (${$file_name{$key}}[1] =~ /R2/){
        $fq2 = ${$file_name{$key}}[1];
    }else{
        $fq1 = ${$file_name{$key}}[1];
    }
    my $trim_opt = "ILLUMINACLIP:$adapter:2:40:15 SLIDINGWINDOW:4:20 MINLEN:50";
    #if(-e $fq1 && -e $fq2){      
        my $bash = join("", $dir_script, $key, ".kneaddata.sh");
        open(OT2, "> $bash") or die "can't open $bash\n";
        print OT2 "kneaddata -i $fq1 -i $fq2 --output-prefix $key -o $dir -v -t 5 --remove-intermediate-output  --trimmomatic /data/share/anaconda3/share/trimmomatic/ --trimmomatic-options \'$trim_opt\'  --bowtie2-options \'--very-sensitive --dovetail\' -db /data/share/database/kneaddata_database/Homo_sapiens_Bowtie2_v0.1/Homo_sapiens\n";       
        close(OT2);

        print OT "sh $bash\n";        
    #}
}

print OT "kneaddata_read_count_table --input $dir --output $dir/01kneaddata_sum.tsv\n";
close(OT);

sub usage{
    print <<USAGE;
usage:
    perl $0 -f <file> -d <real_dir> -o <out> -a <adapter>
options:
    -f|file     :[essential].
    -d|real_dir :[essential].  
    -o|out      :[essential].
    -a|adapt    :[essential].
USAGE
    exit;
};

sub version {
    print <<VERSION;
    version:    v1.1
    update:     20201223 - 20201224
    author:     zouhua1\@outlook.com
VERSION
};

merge PE data

merge.pl:合并PE数据

#!/usr/bin/perl 

use warnings;
use strict;
use Getopt::Long;

my ($file, $real_dir,  $out, $help, $version);
GetOptions(
    "f|file:s"  =>  \$file, 
    "d|real_dir:s"  => \$real_dir,  
    "o|out:s"   =>  \$out,
    "h|help:s"  =>  \$help,
    "v|version" =>  \$version
);
&usage if(!defined $out);

my $dir = "$real_dir/result/02.merge"; 
system "mkdir -p $dir" unless(-d $dir);

# script
my $dir_script = "$real_dir/result/script/02.merge/"; 
system "mkdir -p $dir_script" unless(-d $dir_script);

open(IN, $file) or die "can't open $file";
my %file_name;
<IN>;
while(<IN>){
    chomp;
    my @tmp = split("\t", $_);
    push (@{$file_name{$tmp[0]}}, $tmp[2]);
}
close(IN);

my ($fq1, $fq2);
open(OT, "> $out") or die "can't open $out\n";
foreach my $key (keys %file_name){
    $fq1 = join("", "./result/01.kneaddata/", $key, "_paired_1.fastq");
    $fq2 = join("", "./result/01.kneaddata/", $key, "_paired_2.fastq");
    #if(-e $fq1 && -e $fq2){
        my $bash = join("", $dir_script, $key, ".merge.sh");
        open(OT2, "> $bash") or die "can't open $bash\n";        
        print OT2 "fastp -i $fq1 -I $fq2 -h $dir/$key\_merge.html -j $dir/$key\_merge.json -m --merged_out $dir/$key\_merge.fastq.gz --failed_out  $dir/$key\_failed.fastq.gz --include_unmerged --overlap_len_require 6 --overlap_diff_percent_limit 20 --detect_adapter_for_pe -5 -r -l 20 -y --thread 5\n";
        close(OT2);

        print OT "sh $bash\n";         
    #}
}
close(OT);

sub usage{
    print <<USAGE;
usage:
    perl $0 -f <file> -d <real_dir> -o <out> 
options:
    -f|file     :[essential].
    -d|real_dir :[essential].    
    -o|out      :[essential].
USAGE
    exit;
};

sub version {
    print <<VERSION;
    version:    v1.1
    update:     20201223 - 20201224
    author:     zouhua1\@outlook.com
VERSION
};

get profile matrix

humann.pl:获取功能等profile数据

#!/usr/bin/perl 

use warnings;
use strict;
use Getopt::Long;

my ($file, $real_dir, $out, $help, $version);
GetOptions(
    "f|file:s"  =>  \$file,
    "d|real_dir:s"  => \$real_dir,  
    "o|out:s"   =>  \$out,
    "h|help:s"  =>  \$help,
    "v|version" =>  \$version
);
&usage if(!defined $out);

my $dir = "$real_dir/result/03.humann"; 
system "mkdir -p $dir" unless(-d $dir);

my $dir_log = "$real_dir/result/03.humann/log"; 
system "mkdir -p $dir_log" unless(-d $dir_log);

my $genefamilies = "$real_dir/result/03.humann/genefamilies"; 
system "mkdir -p $genefamilies" unless(-d $genefamilies);
my $pathabundance = "$real_dir/result/03.humann/pathabundance"; 
system "mkdir -p $pathabundance" unless(-d $pathabundance);
my $pathcoverage = "$real_dir/result/03.humann/pathcoverage"; 
system "mkdir -p $pathcoverage" unless(-d $pathcoverage);

my $dir_metaphlan = "$real_dir/result/03.humann/metaphlan"; 
system "mkdir -p $dir_metaphlan" unless(-d $dir_metaphlan);

# script
my $dir_script = "$real_dir/result/script/03.humann/"; 
system "mkdir -p $dir_script" unless(-d $dir_script);

open(IN, $file) or die "can't open $file";
my %file_name;
<IN>;
while(<IN>){
    chomp;
    my @tmp = split("\t", $_);
    push (@{$file_name{$tmp[0]}}, $tmp[2]);
}
close(IN);

my ($fq);
open(OT, "> $out") or die "can't open $out\n";
foreach my $key (keys %file_name){
    $fq = join("", "./result/02.merge/", $key, "_merge.fastq.gz");
    #if($fq){

        my $bash = join("", $dir_script, $key, ".humann.sh");
        open(OT2, "> $bash") or die "can't open $bash\n";         
        print OT2 "humann --input $fq --output $dir --threads 10\n";
        print OT2 "mv $dir/$key\_merge_humann_temp/$key\_merge_metaphlan_bugs_list.tsv $dir_metaphlan/$key\_metaphlan.tsv\n";
        print OT2 "mv $dir/$key\_merge_humann_temp/$key\_merge.log $dir_log\n";
        print OT2 "mv $dir/$key\_merge_genefamilies.tsv $genefamilies/$key\_genefamilies.tsv\n";
        print OT2 "mv $dir/$key\_merge_pathabundance.tsv $pathabundance/$key\_pathabundance.tsv\n";
        print OT2 "mv $dir/$key\_merge_pathcoverage.tsv $pathcoverage/$key\_pathcoverage.tsv\n";
        print OT2 "rm -r $dir/$key\_merge_humann_temp/\n";
        close(OT2);

        print OT "sh $bash\n";         
    #}
}
close(OT);

sub usage{
    print <<USAGE;
usage:
    perl $0 -f <file> -d <real_dir> -o <out> 
options:
    -f|file     :[essential].
    -d|real_dir :[essential].    
    -o|out      :[essential].
USAGE
    exit;
};

sub version {
    print <<VERSION;
    version:    v1.1
    update:     20201223 - 20201225
    author:     zouhua1\@outlook.com
VERSION
};

metaphlan.pl:获取物种组成谱

#!/usr/bin/perl 

use warnings;
use strict;
use Getopt::Long;

my ($file, $real_dir, $out, $help, $version);
GetOptions(
    "f|file:s"  =>  \$file,
    "d|real_dir:s"  => \$real_dir,  
    "o|out:s"   =>  \$out,
    "h|help:s"  =>  \$help,
    "v|version" =>  \$version
);
&usage if(!defined $out);

my $dir = "$real_dir/result/04.metaphlan"; 
system "mkdir -p $dir" unless(-d $dir);

# script
my $dir_script = "$real_dir/result/script/04.metaphlan/"; 
system "mkdir -p $dir_script" unless(-d $dir_script);

open(IN, $file) or die "can't open $file";
my %file_name;
<IN>;
while(<IN>){
    chomp;
    my @tmp = split("\t", $_);
    push (@{$file_name{$tmp[0]}}, $tmp[2]);
}
close(IN);

my ($fq);
open(OT, "> $out") or die "can't open $out\n";
foreach my $key (keys %file_name){
    $fq = join("", "./result/02.merge/", $key, "_merge.fastq.gz");
    #if($fq){

        my $bash = join("", $dir_script, $key, ".metaphlan.sh");
        open(OT2, "> $bash") or die "can't open $bash\n";         
        print OT2 "metaphlan $fq --bowtie2out $dir/$key\_metagenome.bowtie2.bz2 --nproc 10 --input_type fastq -o $dir/$key\_metagenome.tsv --unknown_estimation -t rel_ab_w_read_stats\n";
        close(OT2);

        print OT "sh $bash\n";         
    #}
}

print OT "merge_metaphlan_tables.py $dir/*metagenome.tsv > $dir/merge_metaphlan.tsv\n";
print OT "Rscript /data/share/database/metaphlan_databases/calculate_unifrac.R $dir/merge_metaphlan.tsv /data/share/database/metaphlan_databases/mpa_v30_CHOCOPhlAn_201901_species_tree.nwk $dir/unifrac_merged_mpa3_profiles.tsv";
close(OT);

sub usage{
    print <<USAGE;
usage:
    perl $0 -f <file> -d <real_dir> -o <out> 
options:
    -f|file     :[essential].
    -d|real_dir :[essential].    
    -o|out      :[essential].
USAGE
    exit;
};

sub version {
    print <<VERSION;
    version:    v1.0
    update:     20201225 - 20201225
    author:     zouhua1\@outlook.com
VERSION
};

主程序

main.pl:生成所有的准备文件

#!/usr/bin/perl 

use warnings;
use strict;
use Getopt::Long;
use FindBin qw($RealBin);
use Cwd 'abs_path';

my ($file, $adapter, $out, $help, $version);
GetOptions(
    "f|file:s"  =>  \$file,
    "o|out:s"   =>  \$out,
    "a|adapter:s"   =>  \$adapter,
    "h|help:s"  =>  \$help,
    "v|version" =>  \$version
);
&usage if(!defined $out);

my $Bin = $RealBin;
my $cwd = abs_path;

# output
my $dir = "$cwd/result/"; 
system "mkdir -p $dir" unless(-d $dir);

########## output #########################################
# bash script per step
# combine all steps in one script
my $qc        = join("", $dir, "Run.s1.qc.sh");
my $kneaddata = join("", $dir, "Run.s2.kneaddata.sh");
my $merge     = join("", $dir, "Run.s3.merge.sh");
my $humann    = join("", $dir, "Run.s4.humann.sh");
my $metaphlan = join("", $dir, "Run.s5.metaphlan.sh");


# scripts in bin
my $bin         = "$Bin/bin";
my $s_qc        = "$bin/qc.pl";
my $s_kneaddata = "$bin/kneaddata.pl";
my $s_merge     = "$bin/merge.pl";
my $s_humann    = "$bin/humann.pl";
my $s_metaphlan = "$bin/metaphlan.pl";

########## Steps in metagenomics pipeline #################
##################################################
#  step1 reads quality scan
`perl $s_qc -f $file -d $cwd -o $qc`;

##################################################
#  step2 filter and trim low quality reads;
#        remove host sequence
`perl $s_kneaddata -f $file -d $cwd -a $adapter -o $kneaddata`;

##################################################
#  step3 merge PE reads
`perl $s_merge -f $file -d $cwd -o $merge`;

##################################################
#  step4 get function profile
`perl $s_humann -f $file -d $cwd -o $humann`;

##################################################
#  step5 get taxonomy profile
`perl $s_metaphlan -f $file -d $cwd -o $metaphlan`;


open(OT, "> $out") or die "can't open $out\n";
print OT "sh $qc\nsh $kneaddata\nsh $merge\nsh $humann\nsh $metaphlan\n";
close(OT);


sub usage{
    print <<USAGE;
usage:
    perl $0 -f <file> -o <out> -a <adapter>
options:
    -f|file     :[essential].
    -o|out      :[essential].
    -a|adapter  :[essential].
USAGE
    exit;
};

sub version {
    print <<VERSION;
    version:    v1.1
    update:     20201224 - 20201225
    author:     zouhua1\@outlook.com
VERSION
};

运行主程序:准备samples.fqpath.tsv和adapter.fa文件即可生成所有文件

perl main.pl -f samples.fqpath.tsv -a TruSeq2-PE.fa -o Run.all.sh

Run.all.sh 文件包含如下命令

sh /data/user/zouhua/pipeline/MetaGenomics_v2/result/Run.s1.qc.sh
sh /data/user/zouhua/pipeline/MetaGenomics_v2/result/Run.s2.kneaddata.sh
sh /data/user/zouhua/pipeline/MetaGenomics_v2/result/Run.s3.merge.sh
sh /data/user/zouhua/pipeline/MetaGenomics_v2/result/Run.s4.humann.sh
sh /data/user/zouhua/pipeline/MetaGenomics_v2/result/Run.s5.metaphlan.sh

后续

这只是一个简单的perl流程,我更想将其做成nextflow语言的流程或者其他标准生信语言的流程,后面有时间再搞,先学好数学和python。

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
禁止转载,如需转载请通过简信或评论联系作者。
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 204,732评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 87,496评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 151,264评论 0 338
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,807评论 1 277
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,806评论 5 368
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,675评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,029评论 3 399
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,683评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 41,704评论 1 299
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,666评论 2 321
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,773评论 1 332
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,413评论 4 321
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,016评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,978评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,204评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,083评论 2 350
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,503评论 2 343

推荐阅读更多精彩内容