在工作中,遇见了很多需要使用的hmmer的内容,但是发现中文内容中没有太多良好完整的教程,这里考虑按照官方的Userguide的顺序简要的写一下hmmer的主要功能:
hmmer是什么
HMMER is used for searching sequence databases for sequence homologs, and for making sequence alignments. It implements methods using probabilistic models called profile hidden Markov models (profile HMMs).
HMMER is often used together with a profile database, such as Pfam or many of the databases that participate in Interpro. But HMMER can also work with query sequences, not just profiles, just like BLAST. For example, you can search a protein query sequence against a database with phmmer, or do an iterative search with jackhmmer.
HMMER is designed to detect remote homologs as sensitively as possible, relying on the strength of its underlying probability models. In the past, this strength came at significant computational expense, but as of the new HMMER3 project, HMMER is now essentially as fast as BLAST.
HMMER can be downloaded and installed as a command line tool on your own hardware, and now it is also more widely accessible to the scientific community via new search servers at the European Bioinformatics Institute.
以上是官网的介绍,简单来说,hmmer是使用隐马尔可夫模型,在序列数据库中搜索同源序列,并进行序列比对的工具箱,拥有比blast更高的准确性和相同的速度。常用的Pfam数据库和Resfam数据库都提供hmmer的操作格式。
hmmer工具箱的内容
以下包含了hmmer最新版本的全部工具
hmmbuild: build profile from input multiple alignment
hmmalign: make multiple sequence alignment using a profile
hmmsearch: search profile against sequence database
hmmscan: search sequence against profile database
hmmpress: prepare profile database for hmmscan
phmmer: search single sequence against sequence database
jackhmmer: iteratively search single sequence against database
nhmmer: search DNA query against DNA sequence database
nhmmscan: search DNA sequence against a DNA profile database
hmmfetch: retrieve profile(s) from a profile file
hmmstat: show summary statistics for a profile file
hmmemit: generate (sample) sequences from a profile
hmmlogo: produce a conservation logo graphic from a profile
hmmconvert: convert between different profile file formats
hmmpgmd: search daemon for the hmmer.org website
hmmpgmd_shard: sharded search daemon for the hmmer.org website
makehmmerdb: prepare an nhmmer binary database
hmmsim: collect score distributions on random sequences
alimask: add column mask to a multiple sequence alignment
其中我们常用的也只有hmmbuild、hmmsearch、hmmscan、hmmalign
安装 hmmer
推荐使用conda:
conda install -c bioconda hmmer
#or
conda install -c bioconda/label/cf201901 hmmer
Searching a sequence database with a profile
步骤1:用hmmbuild建立一个profile
使用文件为http://eddylab.org/software/hmmer/hmmer.tar.gz解压缩后hmmer-3.3.2/tutorial/文件夹
文件globins4.sto是一种对齐序列的格式,文件内容包括:
使用hmmbuild:
hmmbuild globins4.hmm globins4.sto
输出:
# hmmbuild :: profile HMM construction from multiple sequence alignments
# HMMER 3.3.2 (Nov 2020); http://hmmer.org/
# Copyright (C) 2020 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# input alignment file: globins4.sto
# output HMM file: globins4.hmm
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# idx name nseq alen mlen eff_nseq re/pos description
#---- -------------------- ----- ----- ----- -------- ------ -----------
1 globins4 4 171 149 0.96 0.589
# CPU time: 0.05u 0.00s 00:00:00.05 Elapsed: 00:00:00.05
globins4.hmm内容为:步骤2:使用hmmsearch搜索序列数据库
hmmsearch globins4.hmm uniprot_sprot.fasta > globins4.out
hmmsearch接受FASTA格式,同时也接受EMBL/UniProtKB文本格式和Genbank格式,能够自动检索文件的格式
globins4.out文件解释:
-第一部分是头文件,它告诉您运行什么程序,在什么上运行,使用什么选项
-第二部分是hit的列表,按照e值排序类似blast的m8格式
最后两列每个目标序列的名称和可选描述。描述行通常会被截短,以保持行长度在合理范围内。如果您想要完整的描述行,并且不介意输出行太长,可以使用
--notextw
选项。
最重要的指标是e-value,表明hit的假阳性预期,e值是衡量统计显著性的指标e值越低,说明命中越明显。通常考虑具有e值的序列< 10−3左右为显著命中。
其次score是 bit score,有些人喜欢看到bit score而不是e值,因为bit score不取决于序列数据库的大小,只取决于.hmm和目标序列。e值则取决于您搜索的数据库的大小:如果您搜索的数据库大小是原来的10倍,那么您得到的e-value是原来的10倍。
最后一个数字是bias,也就是score的残差,是用于bit score的偏置序列组合的修正项。只有在残差特别大的时候才需要注意。
接下来三个数字也是E-value, score, 和 bias,但只针对序列中得分最高的一个结构域,而不是其所标识的所有结构域的和。这在本例中并不明显,因为这个例子中的所有珠蛋白都只包含一个单一的珠蛋白结构域。
由于在使用hmm搜索的过程中,可能出现有多个得分低的结构域凑成高得分总比对的结果,因此推荐的筛选方法是:
- 如果两个e值都是显著的(<< 1),既存在高同源的可能性
- 如果全序列e值显著,而单个最佳域e值不显著,则目标序列可能是一个多multidomain remote homolog,注意这只是重复序列的情况