基本知识
- 通过一定的统计学方法对试验组与对照组进行筛选,使筛选出来的研究对象在某些重要临床特征(潜在的混杂因素)上具有可比性
- 一般是通过某种统计学模型求得每个观测的多个协变量的综合倾向性得分,再按照倾向性得分是否接近进行匹配
- 最常用的统计模型一般是以分组变量为因变量,其它可能影响结果的混杂因素为协变量构建logistic回归模型
- 计算每个观测的倾向得分,按照得分大小进行匹配
代码实现(使用MatchIt
包)
library(MatchIt)
library(tableone)
data(lalonde)
head(lalonde,4)
# treat age educ race married nodegree re74 re75 re78
# NSW1 1 37 11 black 1 1 0 0 9930.046
# NSW2 1 22 9 hispan 0 1 0 0 3595.894
# NSW3 1 30 12 black 0 0 0 0 24909.450
# NSW4 1 27 11 black 0 1 0 0 7506.146
str(lalonde)
# 'data.frame': 614 obs. of 9 variables:
# $ treat : int 1 1 1 1 1 1 1 1 1 1 ...
# $ age : int 37 22 30 27 33 22 23 32 22 33 ...
# $ educ : int 11 9 12 11 8 9 12 11 16 12 ...
# $ race : Factor w/ 3 levels "black","hispan",..: 1 2 1 1 1 1 1 1 1 3 ...
# $ married : int 1 0 0 0 0 0 0 0 0 1 ...
# $ nodegree: int 1 1 0 1 1 1 0 1 0 0 ...
# $ re74 : num 0 0 0 0 0 0 0 0 0 0 ...
# $ re75 : num 0 0 0 0 0 0 0 0 0 0 ...
# $ re78 : num 9930 3596 24909 7506 290 ...
#dput(names(lalonde))
preBL <- CreateTableOne(vars=c("treat","age","educ","race","married","nodegree","re74","re75","re78"),
strata="treat",data=lalonde,
factorVars=c("treat","race","married","nodegree"))
# treat是感兴趣变量,re78为结局变量
print(preBL,showAllLevels = TRUE)
f=matchit(treat~re74+re75+educ+race+age+married+nodegree,data=lalonde,method="nearest",ratio = 1)
# treat是感兴趣变量,re78为结局变量
summary(f)
# ...
# Sample Sizes:
# Control Treated
# All 429 185
# Matched 185 185
# Unmatched 244 0
# Discarded 0 0
matchdata=match.data(f)
mBL <- CreateTableOne(vars=c("treat","age","educ","race","married","nodegree","re74","re75","re78"),
strata="treat",data=matchdata,
factorVars=c("treat","race","married","nodegree"))
print(mBL,showAllLevels = TRUE)
plot(f, type = 'jitter', interactive = FALSE)
可见race这个变量还是不平衡,使用卡钳值来解决
f1=matchit(treat~re74+re75+educ+race+age+married+nodegree,data=lalonde,method="nearest",caliper=0.05)
summary(f1)
# ...
# Sample Sizes:
# Control Treated
# All 429 185
# Matched 109 109
# Unmatched 320 76
# Discarded 0 0
matchdata1=match.data(f1)
mBL1 <- CreateTableOne(vars=c("treat","age","educ","race","married","nodegree","re74","re75","re78"),
strata="treat",data=matchdata1,
factorVars=c("treat","race","married","nodegree"))
print(mBL1,showAllLevels = TRUE)
plot(f1, type = 'jitter', interactive = FALSE)
导出结果数据
library(foreign)
matchdata$id<-1:nrow(matchdata)
write.csv(matchdata1,"matchdata.csv")
# write.dta(matchdata,"matchdata.dta")
- PSM的适用条件:对照组样本量足够大,对照组和试验组样本量之比5:1以上,确保绝大多数试验组对象可以匹配上合适的对照,最好所有试验组对象均得到良好匹配;
- PSM与回归的关系:能用PSM的均可以用回归分析,可以用回归的未必可以用PSM。建议同时采用PSM和回归分析处理数据,当两者结果一致的时候说明结果较可信
参考资料
丁香园课程完整版R语言进阶之机器学习
How to use R for matching samples (propensity score)