背景: 由于人和小鼠研究进展差异,人类基因功能/注释研究会更加深入,一些数据库只有人的注释。又或者研究中通常采用小鼠模型进行验证,这种情况下就会涉及一些基因 name / id 转换。
下面就介绍下一般基因转换方式,概括如下:
特殊基因 id/name 转换: R包(biomaRt)
全基因组 id/name 转换: 从Ensembl中直接下载对应关系文件并进行转换
另一个有意思的R包(模式物种基因各大数据库注释查询): AnnotationDbi
同源基因数据库列表: List of orthology databases
1. 基于R包(biomaRt)
安装biomaRt包:
library("BiocManager")
BiocManager::install("biomaRt")
library("biomaRt")
listMarts()
## biomart version
##1 ENSEMBL_MART_ENSEMBL Ensembl Genes 106
##2 ENSEMBL_MART_MOUSE Mouse strains 106
##3 ENSEMBL_MART_SNP Ensembl Variation 106
##4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 106
小鼠基因转人类基因:
library("biomaRt")
human = useEnsembl(biomart="ensembl", dataset = "hsapiens_gene_ensembl")
mouse = useEnsembl(biomart="ensembl", dataset = "mmusculus_gene_ensembl")
# Basic function to convert mouse to human gene names
convertMouseGeneList <- function(x){
genesV2 = getLDS(attributes = c("mgi_symbol"), filters = "mgi_symbol", values = x , mart = mouse, attributesL = c("hgnc_symbol"), martL = human, uniqueRows=T)
humanx <- unique(genesV2[, 2])
# Print the first 6 genes found to the screen
return(humanx)
}
musGenes <- c("Hmmr", "Tlx3", "Cpeb4")
convertMouseGeneList(musGenes)
## 测试
musGenes <- c("Hmmr", "Tlx3", "Cpeb4")
convertMouseGeneList(musGenes)
## [1] "HMMR" "CPEB4" "TLX3"
#将代转换基因放在文件中,并读取
mmu_genes = read.table("Gene.mmu",header = TRUE,sep= "\t")
head(mmu_genes$Gene)
## [1] "Xkr4" "Gm1992" "Gm19938" "Rp1" "Sox17" "Gm37587"
报错:
##Error: biomaRt has encountered an unexpected server error.
##Consider trying one of the Ensembl mirrors (for more details look at ?useEnsembl)
人类基因转小鼠基因:
hsa = read.table("hsa.raw",header = TRUE,sep= "\t")
head(hsa)
## Gene
##1 Xkr4
##2 Gm1992
convertHumanGeneList <- function(x){
genesV2 = getLDS(attributes = c("hgnc_symbol"), filters = "hgnc_symbol", values = x , mart = human, attributesL = c("mgi_symbol"), martL = mouse, uniqueRows=T)
humanx <- unique(genesV2[, 2])
# Print the first 6 genes found to the screen
return(humanx)
}
humGenes <-hsa$Gene
convertHumanGeneList(humGenes)
## Error: biomaRt has encountered an unexpected server error.
##Consider trying one of the Ensembl mirrors (for more details look at ?useEnsembl)
经过上述尝试发现,输入部分基因 name list 转换可以很好的完成;拿全部基因组的gene name做转换还是会出现问题,具体讨论解决方案可见: 链接。
2. 从Ensembl中直接下载对应关系文件并进行转换
Step1:Enaembl 官网->BioMart; 选择对应基因组 : 链接
Step2: 属性中选择“Homologues”: Gene stable ID, Gene name ;
Step3:选择对应orthologs的物种(根据首字母)
Step4: 下载: Result -> Go
Step5: 查看下载结果,写脚本自己转换吧;
转换结果:小鼠原始gene 数目:24784
转换后gene数目:16412
3. 其他
另外,发现了一个比较有意思的R包,对于探索基因功能注释以及富集分析会有帮助:AnnotationDbi , org.Hs.eg.db;
安装:
Library(BiocManager)
BiocManager::install("Orthology.eg.db")
keytypes(org.Hs.eg.db) ##查看基因注释数据库
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GO" "GOALL" "IPI" "MAP" "OMIM"
## [16] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID"
## [21] "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG" "UNIGENE"
## [26] "UNIPROT"
columns(org.Hs.eg.db) #查看通用数据库中id注释
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GO" "GOALL" "IPI" "MAP" "OMIM"
## [16] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID"
## [21] "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG" "UNIGENE"
## [26] "UNIPROT"
实施方案具体搜索哈~