教你如何自动筛选显著变量

以下stata源代码来自于宝器兄的分享,整个过程需要使用stata和Python小程序。其中Python小程序是为了生成变量的所有排列组合,可通过下文获取链接。

解释变量这么多,到底应该怎么选才显著?

本文是根据自己理解对源代码做简要解释,辅助大家理解这段代码,以期举一反三。

一、目标显著变量:单个

假设我们现在的需求是:找到能使变量mpg显著的其他解释变量组合,同时其他解释变量中需要包含rep78

注:以下示例中,从foreign weight length headroom turn中获取目标变量组合。

*导入示例数据
sysuse auto, clear

*必须存在的解释变量(选填)
local fixed_vars "rep78"
*解释变量组合(来自python)
local var_lists = "foreign weight length headroom turn foreign,weight foreign,length foreign,headroom foreign,turn weight,length weight,headroom weight,turn length,headroom length,turn headroom,turn foreign,weight,length foreign,weight,headroom foreign,weight,turn foreign,length,headroom foreign,length,turn foreign,headroom,turn weight,length,headroom weight,length,turn weight,headroom,turn length,headroom,turn foreign,weight,length,headroom foreign,weight,length,turn foreign,weight,headroom,turn foreign,length,headroom,turn weight,length,headroom,turn foreign,weight,length,headroom,turn"

*找到显著变量并显示出
foreach var_list of local var_lists{
    local var_list : subinstr local var_list "," " ", all
    qui regress price mpg `fixed_vars' `var_list'
    
    if (_se[mpg] != 0 & abs(_b[mpg]/_se[mpg])>2){
        display in r"`fixed_vars'" " " "`var_list'"
        local con_var_lists = "`con_var_lists'" + "," + "`var_list'"
    }
}

*对显著变量进行回归
local con_var_lists : subinstr local con_var_lists " " ".", all
local con_var_lists : subinstr local con_var_lists "," " ", all

foreach con_var_list of local con_var_lists{
    local con_var_list : subinstr local con_var_list "." " ", all
    regress price mpg `fixed_vars' `con_var_list'
}

知识点1:局部宏

local lclname [=exp | :extended_fcn | "[string]" | `"[string]"']

局部宏后面可以跟

  • 表达式
  • 扩展函数
  • “字符串”
  • `"字符串"'(字符串中含双引号时用)

local fixed_vars "rep78" 字符串

local var_lists = "foreign weight length..." 表达式

local var_list : subinstr local var_list "," " ", all 扩展函数

2.foreach循环

foreach var_list of local var_lists{
    ...
}
  • foreach lname of local lmacname {
  • foreach lname of global gmacname {

foreach lname of local list { ... } and foreach lname of global list { ... } obtain the list from the indicated place. This method of using foreach produces the fastest executing code.

官方示例:

local grains "rice wheat corn rye barley oats"
foreach x of local grains {
    display "`x'"
}

global money "Franc Dollar Lira Pound"
foreach y of global money {
    display "`y'"
}

所以有

local var_lists = "...."
foreach var_list of local var_lists{
   ...
}

注意:foreach遍历时,识别符号为空格

3.subinstr函数

subinstr(s1,s2,s3,n)
Description: s1, where the first n occurrences in s1 of s2 have been replaced with s3

意思是“将字符串s1中的字符串s2出现的前n个,替换成字符串s3。”

对于我们的代码:

local var_list : subinstr local var_list "," " ", all

其想表达的意思是将,换成空格,这样做的原因是后面使用regress命令时,变量之间是以空格作为分隔。我们可以把这块功能单独拿出来看看:

local var_lists = "foreign foreign,weight foreign,weight,length foreign,weight,length,headroom foreign,weight,length,headroom,turn"
        foreach var_list of local var_lists {
            local var_list : subinstr local var_list "," " ", all
            display "`var_list'"
        }

foreign
foreign weight
foreign weight length
foreign weight length headroom
foreign weight length headroom turn

对比下,显然如果把foreign,weight,length,headroom,turn放入regress命令中是会报错的。

local var_lists = "foreign foreign,weight foreign,weight,length foreign,weight,length,headroom foreign,weight,length,headroom,turn"
        foreach var_list of local var_lists {
            display "`var_list'"
        }
foreign
foreign,weight
foreign,weight,length
foreign,weight,length,headroom
foreign,weight,length,headroom,turn

4.显著性判断

回归系数和标准误的比值为t值,当t的绝对值大于1.96时,就代表代表p值达到5%的显著水平(即p<0.05),一般表示为两颗星星。

_se[mpg] != 0 & abs(_b[mpg]/_se[mpg])>2

理解该句的关键在于认知_se[mpg]这种表示,在stata中被称作_variables

Expressions may also contain _variables (pronounced "underscore variables"), which are built-in system variables that are created and updated by Stata. They are called variables because their names all begin with the underscore character, "".

[eqno]_b[varname] (synonym: [eqno]_coef[varname]) contains the value (to machine precision) of the coefficient on varname from the most recently fitted model (such as ANOVA, regression, Cox, logit, probit, and multinomial logit).

[eqno]_se[varname] contains the value (to machine precision) of the standard error of the coefficient on varname from the most recently fit model (such as ANOVA,regression, Cox, logit, probit, and multinomial logit).

5.display函数

display in r"`fixed_vars'" " " "`var_list'"

该句里的rred的简写,此处用的是SMCL,stata的标记语言。
SMCL, which stands for Stata Markup and Control Language and is pronounced “smickle”, is Stata’s output language.
SMCL is markup language of Stata and mastering it helps you create nicer outputs for your packages and also, write better help files.

image.png

In brief, markup languages are computer languages in a sense that they include syntax and are interpretted by computers. They are mainly used for annotating a document. For example, LaTeX, HTML, XHTML, and XML all are markup languages that are used for annotating documents. SMCL is designed based on the same consept and it includes syntax that can be interpreted by Stata for creating electronic documents such as help files, log files, and results' window outputs.
SMCL Markup Language

6.语法理解

local con_var_lists : subinstr local con_var_lists " " ".", all
local con_var_lists : subinstr local con_var_lists "," " ", all
foreach con_var_list of local con_var_lists{
    local con_var_list : subinstr local con_var_list "." " ", all
    ...
}

这里面有三个local语句,对它们的理解要建立对“找到显著变量并显示出”模块内con_var_lists的结果之上。

*导入数据
sysuse auto, clear
*必须存在的解释变量(选填)
local fixed_vars "rep78"
*解释变量组合(来自python)
local var_lists = "foreign weight length headroom turn foreign,weight foreign,length foreign,headroom foreign,turn weight,length weight,headroom weight,turn length,headroom length,turn headroom,turn foreign,weight,length foreign,weight,headroom foreign,weight,turn foreign,length,headroom foreign,length,turn foreign,headroom,turn weight,length,headroom weight,length,turn weight,headroom,turn length,headroom,turn foreign,weight,length,headroom foreign,weight,length,turn foreign,weight,headroom,turn foreign,length,headroom,turn weight,length,headroom,turn foreign,weight,length,headroom,turn"

*找到显著变量并显示出
foreach var_list of local var_lists{
    local var_list : subinstr local var_list "," " ", all
    qui regress price mpg `fixed_vars' `var_list'
    
    if (_se[mpg] != 0 & abs(_b[mpg]/_se[mpg])>2){
//      display in r"`fixed_vars'" " " "`var_list'"
        local con_var_lists = "`con_var_lists'" + "," + "`var_list'"
    }
}

dis in r"`con_var_lists'"

,foreign,headroom,turn,foreign headroom,foreign turn,headroom turn,foreign headroom turn

输出结果为,foreign,headroom,turn,foreign headroom,foreign turn,headroom turn,foreign headroom turn

就是说显著变量有以下7组:

  • foreign
  • headroom
  • turn
  • foreign headroom
  • foreign turn
  • headroom turn
  • foreign headroom turn

第一个local语句
local con_var_lists : subinstr local con_var_lists " " ".", all
把空格替换为.,效果是foreign headroomforeign.headroom
否则使用foreach循环时会把foreign headroom拆开,造成错误。

这条语句处理后的效果
,foreign,headroom,turn,foreign.headroom,foreign.turn,headroom turn,foreign.headroom.turn

第二个local语句:
local con_var_lists : subinstr local con_var_lists "," " ", all
,替换为空格,这是也是为了foreach循环.

这条语句处理后的效果
foreign headroom turn foreign.headroom foreign.turn headroom turn foreign.headroom.turn

第三个local语句:
local con_var_list : subinstr local con_var_list "." " ", all
.替换为空格,这是为了使用regress回归命令,显然如果把foreign.headroom.turn放入regress命令中是会报错的。

二、目标显著变量:多个

假设我们现在的需求是:找到能使变量mpgrep78均显著的其他解释变量组合。

注:以下示例中,从foreign weight length headroom turn中获取目标变量组合。

*导入示例数据
sysuse auto, clear

*必须存在的解释变量(选填)
local fixed_vars ""
*解释变量组合(来自python)
local var_lists = "foreign weight length headroom turn foreign,weight foreign,length foreign,headroom foreign,turn weight,length weight,headroom weight,turn length,headroom length,turn headroom,turn foreign,weight,length foreign,weight,headroom foreign,weight,turn foreign,length,headroom foreign,length,turn foreign,headroom,turn weight,length,headroom weight,length,turn weight,headroom,turn length,headroom,turn foreign,weight,length,headroom foreign,weight,length,turn foreign,weight,headroom,turn foreign,length,headroom,turn weight,length,headroom,turn foreign,weight,length,headroom,turn"

*找到显著变量并显示出
foreach var_list of local var_lists{
    local var_list : subinstr local var_list "," " ", all
    qui regress price mpg rep78  `fixed_vars' `var_list'
    
    if (_se[mpg] != 0 & abs(_b[mpg]/_se[mpg])>2) &(_se[rep78] != 0 & abs(_b[rep78]/_se[rep78])>2) {
        display in r"`fixed_vars'" " " "`var_list'"
        local con_var_lists = "`con_var_lists'" + "," + "`var_list'"
    }
}

*对显著变量进行回归
local con_var_lists : subinstr local con_var_lists " " ".", all
local con_var_lists : subinstr local con_var_lists "," " ", all
foreach con_var_list of local con_var_lists{
    local con_var_list : subinstr local con_var_list "." " ", all
    regress price mpg rep78 `fixed_vars' `con_var_list'
}
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 202,607评论 5 476
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,047评论 2 379
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 149,496评论 0 335
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,405评论 1 273
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,400评论 5 364
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,479评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,883评论 3 395
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,535评论 0 256
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,743评论 1 295
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,544评论 2 319
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,612评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,309评论 4 318
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,881评论 3 306
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,891评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,136评论 1 259
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 42,783评论 2 349
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,316评论 2 342

推荐阅读更多精彩内容

  • 来源: http://www.douban.com/group/topic/14820131/ 调整变量格式: f...
    MC1229阅读 6,903评论 0 5
  • 英文文档,一开始我也是抗拒的,边翻译边看,也就花费了1个小时基本就阅读过了,我的英文基础其实很差。附上链接:链接:...
    lonecolonel阅读 9,839评论 3 1
  • pyspark.sql模块 模块上下文 Spark SQL和DataFrames的重要类: pyspark.sql...
    mpro阅读 9,440评论 0 13
  • 首先祝大家新年快乐哈!学生的估计明天也要上课了,工作的估计早就去上班了,我也快要上课了,哈哈,新年这段时间一直没有...
    sergiojune阅读 5,573评论 0 2
  • 昨日日期:2019年5月12日 累计天数:28/30 ✊亲子宣言:成为你的妈妈,是上帝对我的恩典! ✨孩子第八个3...
    何川LX阅读 172评论 0 0