【导读】产品运营离不开数据分析,而数据分析少不了统计学知识,我们可能并不关心统计学是什么,但是了解日常业务数据分析中会用到哪些统计学的思维可能对我们会有所帮助,《统计学七支柱》中分别讲到聚合、信息递减法则、似然、内部比较、回归和多元分析、设计、模型和残差,那这些在业务分析中又有什么用呢?
在产品运营过程中我们可能对如何优化产品会有各种各样的想法,但是怎么验证我们的想法是有效的呢?我们需要做3件事:
1) 做出假设:想法可能会对产品有正面效果;
2) 实验:ABTest实验,实验组为采用想法,对照组为未采用想法;
3) 统计分析:统计关键指标,比较差异并估计可信度。
两组的总体情况分别如何?实验组是不是更好?(设计)
为此我们可能会做如下事情:
1)求均值X了解整体情况(聚合,这里假设样本均值是符合正态分布的,即似然)
2)算方差S2,表示样本中之间的差异(内部比较)
3)将统计分布转换成标准正态分布然后进行z检验,置信度分析判断z值与(=0.05时取值1.96)的大小关系(相同均值的情况下,两组样本内部差异越小,置信度越高,而相同差异情况下,样本越多,根号n越大,置信度越高,其中n1和n2在运算中是采用了根号的形式,这在统计中很常用,与信息递减法则吻合)
除了对比知道谁更好之外,我们还会用到预估,根据已知去推测未知,人们常常对未知的东西更好奇。在《统计学七支柱》中讲回归的来源是身高回归的例子,即父母如果都比平均值高的话,子女可能会比父母更矮,而如果父母都矮于平均值的话,子女可能会比父母更高,这也产生了一种回归平庸的说法。将上述的例子用表达式来描述的话可以表达为Y=a*X+b,其中X为父母的身高,Y为子女的身高,而b是平均身高,a是父母对子女身高影响的权重(在业务分析中可能涉及更多的因子,这便是回归和多元分析),在我们得到一组a和b时,这组参数能多好的描述这个模型呢,这需要用到模型和残差,为使数据尽可能符合模型型,可以使用最小二乘法来求解模型参数。
参考:Rick Wicklin, PhD对Stephen M. Stigler的《统计学七支柱》的总结
Aggregation: It sounds like an oxymoron that you can gain knowledge by discarding information, yet that is what happens when you replace a long list of numbers by a sum or mean. Every day the news media reports a summary of billions of stock market transactions by reporting a single a weighted average of stock prices: the Dow Jones Industrial Average. Statisticians aggregate, and policy makers and business leaders use these aggregated values to make complex decisions.
【译】聚合:听起来像是矛盾的事物,您可以通过丢弃信息来获取知识,但是当您用一个总和或平均值替换一长串数字时,就会发生这种情况。 新闻媒体每天都通过报告单个加权平均股价:道琼斯工业平均指数来报告数十亿股市交易的摘要。 统计人员进行汇总,决策者和企业领导者使用这些汇总值来制定复杂的决策。
The law of diminishing information: If 10 pieces of data are good, are 20 pieces twice as good? No, the value of additional information diminishes like the square root of the number of observations, which is why Stigler nicknamed this pillar the "root n rule." The square root appears in formulas such as the standard error of the mean, which describes the probability that the mean of a sample will be close to the mean of a population.
【译】信息递减法则:如果10条数据是好的,那么20条是好两倍吗? 不,附加信息的价值像观测数量的平方根一样在减少,这就是为什么斯蒂格勒将这一支柱称为“根号n规则”。 平方根出现在公式中,例如平均值的标准误,该公式描述了样本平均值接近总体平均值的概率。
Likelihood: Some people say that statistics is "the science of uncertainty." One of the pillars of statistics is being able to confidently state how good a statistical estimate is. Hypothesis tests and p-values are examples of how statisticians use probability to carry out statistical inference.
【译】似然:有人说统计是“不确定性的科学”。 统计的支柱之一是能够自信地说明统计估计的好坏。 假设检验和p值是统计学家如何使用概率进行统计推断的示例。
Intercomparisons: When analyzing data, statisticians usually make comparisons that are based on differences among the data. This is different than in some fields, where comparisons are made against some ideal "gold standard." Well-known analyses such as ANOVA and t-tests utilize this pillar.
【译】内部比较:在分析数据时,统计学家通常根据数据之间的差异进行比较。 这与在某些领域使用理想的“黄金标准”进行比较不同。 诸如ANOVA和t检验之类的著名分析就利用了这一支柱。
Regression and multivariate analysis: Children that are born to two extraordinarily tall parents tend to be shorter than their parents. Similarly, if both parents are shorter than average, the children tend to be taller than the parents. This is known as regression to the mean. Regression is the best known example of multivariate analysis, which also includes dimension-reduction techniques and latent factor models.
【译】回归和多元分析:两个非常高的父母所生的孩子往往比父母矮。 同样,如果父母双方都比平均水平矮,则孩子往往比父母高。 这称为均值回归。 回归是最著名的多元分析示例,其中还包括降维技术和潜在因子模型。
Design: R. A. Fisher, in an address to the Indian Statistical Congress (1938) said "To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of." A pillar of statistics is the design of experiments, and—by extension—all data collection and planning that leads to good data. Included in this pillar is the idea that random assignment of subjects to design cells improves the analysis. This pillar is the basis for agricultural experiments and clinical trials, just to name two examples.
【译】设计:RA费舍尔(RA Fisher)在1938年印度统计大会上的讲话中说:“在实验完成后向统计学家咨询,通常只是要求他进行验尸检查。他也许可以说实验死了什么。 ” 能够产生良好数据的实验和数据收集方案的设计是统计学的支柱之一。 该支柱中包含的想法是,将受试者随机分配到设计单元,这可改善分析。 仅举两个例子,如农业实验和临床试验。
Models and Residuals: This pillar enables you to examine shortcomings of a model by examining the difference between the observed data and the model. If the residuals have a systematic pattern, you can revise your model to explain the data better. You can continue this process until the residuals show no pattern. This pillar is used by statistical practitioners every time that they look at a diagnostic residual plot for a regression model.
【译】模型和残差:通过此支柱,您可以通过检查观察到的数据与模型之间的差异来检查模型的缺点。 如果残差具有系统的模式,则可以修改模型以更好地解释数据。 您可以继续此过程,直到残差没有显示任何模式。 统计专业人员每次查看回归模型的诊断残差图时都会使用此支柱。