一、暴露因素的数据要求:
1.对于暴露因素的GWAS数据,TwoSampleMR需要一个工具变量数据构成的data frame,每行对应一个SNP,至少需要4列,分别为:
- SNP – rs ID (chr和pos可转换为RS号)
- beta – The effect size. If the trait is binary then log(OR) should be used
- se – The standard error of the effect size
- effect_allele – The allele of the SNP which has the effect marked in beta
- other_allele – The non-effect allele (官网上不包含该列,但实际操作过程中缺少该列则无法运行)
beta与OR可以相互转化:beta=log(OR) 效应值 ;
P、Beta/OR、Se转换有公式,知道其中2个可以算出另外一个.
2.其他有助于MR预处理或分析的列包括:(eaf和样本量可用于计算F-stat和R2值)
- eaf – The effect allele frequency
- Phenotype – The name of the phenotype for which the SNP has an effect
3.你也可以提供额外的信息
- chr – Physical position of variant (chromosome)
- position – Physical position of variant (position)
- samplesize – Sample size for estimating the effect size****(可用于计算F-stat和R2值)
- ncase – Number of cases (ncase和samplesize可用于计算power)
- ncontrol – Number of controls
- pval – The P-value for the SNP’s association with the exposure (P值筛选时有用)
- units – The units in which the effects are presented
- gene – The gene or other annotation for the the SNP
二、从现有数据库中获取工具变量:
- 1.安装R包方便导入数据:
if (!requireNamespace("remotes", quietly = TRUE))install.packages("remotes")
if (!requireNamespace("MRInstruments", quietly = TRUE))remotes::install_github("MRCIEU/MRInstruments")
library(MRInstruments)
- 2.GWAS catalog:
data(gwas_catalog)
head(gwas_catalog)
#例如,使用Speliotes等人2010年的研究获得BMI的工具变量:
bmi_gwas <-subset(gwas_catalog,grepl("Speliotes", Author) & Phenotype == "Body mass index")
bmi_exp_dat <- format_data(bmi_gwas)
- 3.Metabolites:
data(metab_qtls)
head(metab_qtls)
#例如,要获得丙氨酸的工具变量:
ala_exp_dat <- format_metab_qtls(subset(metab_qtls, phenotype == "Ala"))
- 4.Proteins:
data(proteomic_qtls)
head(proteomic_qtls)
#例如,为了获得ApoH蛋白的工具变量:
apoh_exp_dat <-
format_proteomic_qtls(subset(proteomic_qtls, analyte == "ApoH"))
- 5.Gene expression levels:
data(gtex_eqtl)
head(gtex_eqtl)
#例如,为了获得皮下脂肪组织中IRAK1BP1基因表达水平的工具变量:
irak1bp1_exp_dat <-
format_gtex_eqtl(subset(
gtex_eqtl,
gene_name == "IRAK1BP1" & tissue == "Adipose Subcutaneous"
))
- 6.DNA methylation levels:
data(aries_mqtl)
head(aries_mqtl)
#例如,为了获得出生时cg25212131 CpG DNA甲基化水平的工具变量:
cg25212131_exp_dat <-
format_aries_mqtl(subset(aries_mqtl, cpg == "cg25212131" &
age == "Birth"))
- 7.IEU GWAS database:
ao <- available_outcomes()
head(ao) #查看数据前6行
head(subset(ao, select = c(trait, id))) #该函数返回数据库中所有可用研究的表格。每个研究都有一个唯一的ID
#从Locke等人2015年GIANT研究中获取BMI相关SNPs,作为工具变量:
bmi2014_exp_dat <- extract_instruments(outcomes = 'ieu-a-2')
这里通过extract_instruments函数从IEU获取工具变量,需要了解一下参数:
● p1 = P-value threshold for keeping a SNP
● clump = Whether or not to return independent SNPs only (default is TRUE)
● r2 = The maximum LD R-square allowed between returned SNPs
● kb = The distance in which to search for LD R-square values
总结成一句话就是,我们通过设置p1参数找到与暴露因素具有显著相关的工具变量(default:p1 = 5e-08);然后通过设置clump参数去掉连锁不平衡(LD)的工具变量(The default is TRUE)(简单理解就是彼此工具变量相近了,研究起来没啥意义);然后我们通过设置p2,r2和kb参数来制定去除LD的标准(默认设置即可,也可按照参考文献设置参数)