数据科学——统计学基础(自学)

Do you want to learn statistics for data science without taking a slow and expensive course? Good news… You can master the core concepts, probability, Bayesian thinking, and even statistical machine learning using only free online resources. Here are the best resources for self-starters!

你是否想要学习数据科学的统计学而不用花费时间和金钱去上一个缓慢而昂贵的课程?好消息,只依靠免费的在线资源,你能够掌握核心概念,概率,贝叶斯思维,甚至是统计机器学习。这是自学的最好资源!

正在上传...取消

By the way... you don't need a math degree to succeed with this approach. Yet, if you do have a math background, you'll definitely enjoy this fun, hands-on method too.

顺道说一句,要达到这一切,你不是必须有一个数学学位,但是呢,如果你拥有数学背景,你一定会享受这个愉悦的,亲自动手的方法。

This guide will equip you with the tools of statistical thinking needed for data science. It will arm you with ahuge advantage over other aspiring data scientistswho try to get by without it.

这篇指导会给你装备从事数据科学所需要的全部统计思维的工作。相较于其他没有此工具加持的有为的数据科学家,它将给你带来巨大的优势。

You see, it can be tempting to jump directly into using machine learning packages once you've learned how to program... And you know what? It's ok if you want to initially get the ball rolling with real projects.

一旦你学会如何编程,你会忍不住直接使用机器学习包。如果你想直接在真实工程去实施,也是可以的。

But, you should never, ever completely skip learning statistics and probability theory. It's essential to progressing your career as a data scientist.

你绝对不应该跳过学习统计学和概率论,它是数据科学家的必修课。

Pre-requisite: Basic Python Skills

先决条件:基础的python技能

To complete this guide, you'll need at least basic Python* programming skills. We'll be learning statistics in an applied, hands-on way.

要会Python!要会Python!要会Python!

Check out our guide,How to Learn Python for Data Science, The Self-Starter Way, for the fastest way to get up to speed with Python. We recommend at least completing up toStep 2in that guide.

下篇会讲如何学习Python做数据科学......

*note: other languages are fine too, but the examples will be in Python.

看这里:其他语言也是可以的,只是这里的例子用的是Python

Statistics Needed for Data Science

数据科学用到的统计学知识

Statistics is a broad field with applications in many industries.

统计学在众多领域都有着广泛的应用。

Wikipedia defines it asthe study of the collection, analysis, interpretation, presentation, and organization of data. Therefore, it shouldn't be a surprise that data scientists need to know statistics.

维基百科定义:是一门关于数据的收集、分析、预测、呈现与组织的学问。故,数据科学家是一定要会统计学的!

正在上传...取消

For example, data analysis requires descriptive statistics and probability theory, at a minimum. These concepts will help you make better business decisions from data.

举个栗子,最低程度上讲,数据分析也需要描述统计学和概率论。这些概念能够帮助你从数据中做出更明智的决定。

Key concepts includeprobability distributions,statistical significance,hypothesis testing, andregression.

关键概念有:概率分布、统计意义、假设检验和回归。

Furthermore, machine learning requires understanding Bayesian thinking. Bayesian thinking is the process of updating beliefs as additional data is collected, and it's the engine behind many machine learning models.

往远了说,机器学习要求了解贝叶斯思维。贝叶斯思维是在收集额外数据时更新信念的过程,它是许多机器学习模型背后的引擎。

Key concepts includeconditional probability,priors and posteriors, andmaximum likelihood.

关键概念包括条件概率、先验和后验,以及最大可能性(最大似然)。

If those terms sound like mumbo jumbo to you, don't worry. This will all make sense once you roll up your sleeves and start learning.

如果这些术语让你听起来像mumbo jumbo,表担心。一旦你撸起袖子开始学习,这些都是有意义的。

The Best Way to Learn to Statistics for Data Science

学习数据科学之统计学最佳途径

By now, you've probably noticed that one common theme in "the self-starter way to learning X" is to skip classroom instruction andlearn by "doing shit."

此处省略若干字,舍弃也是一种得到......

Mastering statistics for data science is no exception.

这里也不例外,Doing shit......

In fact, we're going to tackle key statistical concepts by programming them with code! Trust us... this will be super fun.

事实上,我们将通过编程来处理关键的统计概念!相信我们…这将是非常有趣的。

If you do not have formal math training, you'll find this approach much more intuitive than trying to decipher complicated formulas. It allows you to think through the logical steps of each calculation.

如果你没有正规的数学训练,你会发现这种方法比解复杂的公式更直观。它允许您考虑每个计算的逻辑步骤。(有道翻译很强大嘛!)

If you do have a formal math background, this approach will help you translate theory into practice and give you some fun programming challenges.

如果你有一个正式的数学背景,这个方法将帮助你把理论转化为实践,并带给你一些有趣的编程挑战。

Here are the 3 steps to learning the statistics and probability required for data science:

以下是学习数据科学所需的统计学和概率的三个步骤:

1 Core Statistics Concepts  统计学核心概念

Descriptive statistics, distributions, hypothesis testing, and regression.

描述统计、分布、假设检验和回归。

2 Bayesian Thinking  贝叶斯思维

Conditional probability, priors, posteriors, and maximum likelihood.

条件概率,先验,后验,和最大可能性。

3 Intro to Statistical Machine Learning  机器学习统计概论

Learn basic machine concepts and how statistics fits in.

学习基本的机器概念及统计学的介入。

After completing these 3 steps, you'll be ready to attack more difficult machine learning problems and common real-world applications of data science.

完成这3个步骤后,您将准备好迎击更困难的机器学习问题和数据科学的常见应用程序。

Step 1: Core Statistics Concepts

核心概念

To know how to learn statistics for data science, it's helpful to start by looking at how it will be used.

首先来看下统计学是如何被使用的,有益于后续学习

Let's take a look as some examples of real analyses or applications you might need to implement as a data scientist:

让我们来看看作为数据科学家需要实现的实际分析或应用程序的一些示例:

Experimental design:Your company is rolling out a new product line, but it sells through offline retail stores. You need to design an A/B test that controls for differences across geographies. You also need to estimate how many stores to pilot in for statistically significant results.

实验设计:你的公司正在推出一条新的产品线,但它通过线下零售商店销售。你需要设计一个A/B测试来控制不同地区的差异。您还需要估计有多少商店在统计上有显著的结果。

Regression modeling:Your company needs to better predict the demand of individual product lines in its stores. Under-stocking and over-stocking are both expensive. You consider building a series of regularized regression models.

回归建模:您的公司需要更好地预测其商店中单个产品线的需求。库存不足和库存过多都是昂贵的。您可以考虑构建一系列规范化的回归模型。

Data transformation:You have multiple machine learning model candidates you're testing. Several of them assume specific probability distributions of input data, and you need to be able to identify them and either transform the input data appropriatelyorknow when underlying assumptions can be relaxed.

数据转换:你有多个正在测试的机器学习模型。他们中的一些假设输入数据的特定概率分布,你需要能够识别它们,或者适当地转换输入数据,或者知道什么时候可以放松。

A data scientist makes hundreds of decisions every day. They range from small ones like how to tune a model all the way up big ones like the team's R&D strategy.

数据科学家每天要做出数百项决策。它们的范围从小如如何调整一个模型到大如团队的研发战略。

Many of these decisions require a strong foundation in statistics and probability theory.

许多这些决策需要在统计学和概率论中有坚实的支撑。

For example, data scientists often need to decide which results arebelievableand which arebullshitlikely due to randomness. Plus, they need to knowif there are pockets of interest that should be explored further.

例如,数据科学家经常需要决定哪些结果可信,哪些是胡扯,这可能是随机的。此外,他们还需要知道是否有一些值得进一步研究的利益。

These are central skills in analytical decision making (knowing how to calculate p-values is only scratching the surface).

这些都是分析决策的核心技巧(知道如何计算p值仅仅是皮毛)。

Here's one of the best resources we've found for learning basic statistics as a self-starter:

以下是我们在学习基本统计数据时发现的最有效的资源之一:

Think like a statistician...

像统计学家一样思考

Think Statsis an excellent book (with free PDF version) introducing all the key concepts. The premise of the book? If you know how to program, then you can use that skill to teach yourself statistics. We've found this approach to be very effective, even for those with formal math backgrounds.

这是一本优秀的书(免费的PDF版本),介绍了所有的关键概念。这本书的前提是什么?如果你知道如何编程,那么你就可以利用这个技能自学统计学。我们发现这种方法非常有效,即使对于那些有正式数学背景的人也是如此。

Step 2: Bayesian Thinking

步骤2:贝叶斯思想

One of the philosophical debates in statistics is betweenBayesiansandfrequentists. The Bayesian side is more relevant when learning statistics for data science.

统计学上的哲学辩论之一是贝叶斯人与频率论者之间的争论。在学习数据科学的统计数据时,贝叶斯理论更为重要。

In a nutshell, frequentists use probability only to model sampling processes. This means they only assign probabilities to describe data they've already collected.

简单地说,频率主义者只使用概率来模拟抽样过程。这意味着他们只分配概率来描述他们已经收集的数据。

On the other hand, Bayesians use probability to model sampling processesandto quantify uncertainty before collecting data. If you'd like to learn more about this divide, check out this Quora post:For a non-expert, what's the difference between Bayesian and frequentist approaches?

另一方面,贝叶斯人利用概率来对抽样过程进行建模,并在收集数据之前对不确定性进行量化。如果你想了解更多关于这一分歧的信息,请查看这个Quora帖子:对于一个非专业人士来说,贝叶斯和频繁使用的方法有什么区别?

In Bayesian thinking, the level of uncertainty before collecting data is called theprior probability.It's then updated to aposterior probabilityafter data is collected. This is a central concept to many machine learning models, so it's important to master.

在贝叶斯思想中,在收集数据之前的不确定程度被称为先验概率。在收集数据之后,它会被更新到一个后验概率。这是许多机器学习模型的核心概念,所以掌握它是很重要的。

Again, all of these concepts will make sense once you implement them.

一旦实现这些概念,所有这些概念都是有意义的(有了直观的印象)。

Here's one of the best resources we've found for learning Bayesian thinking as a self-starter:

以下是我们发现的贝叶斯思维的最佳资源之一:

Think like a Bayesian...

像贝叶斯一样思考

Think Bayesis the follow-up book (with free PDF version) of Think Stats. It's all about Bayesian thinking, and it uses the same approach of using programming to teach yourself statistics. This approach is fun and intuitive, and you'll learn each concept's underlying mechanics well since you'll be implementing them.

此书是上本书的接续书,都是关于贝叶斯思想的,它使用相同的编程方法来自学统计学。这种方法很有趣,而且很直观,你将会很好地了解每一个概念的底层机制,因为你将会实现它们。

Step 3: Intro to Statistical Machine Learning

第三步:统计机器学习入门

If you want to learn statistics for data science, there's no better way than playing with statistical machine learning models after you've learned core concepts and Bayesian thinking.

如果你想要学习数据科学的统计学,最好的方法是在你学习了核心概念和贝叶斯思维之后,用统计机器学习模型来学习。

The statistics and machine learning fields are closely linked, and "statistical" machine learning is the main approach to modern machine learning.

统计学和机器学习领域是密切相关的,“统计”机器学习是现代机器学习的主要途径。

In this step, you'll be implementing a few machine learning models from scratch. This will help you unlock true understanding of their underlying mechanics.

在这一步中,您将会从头开始实现一些机器学习模型。这将帮助你解开对他们潜在机制的真正理解。

At this stage, it's fine if you're just copying code, line-by-line.

在这个阶段,如果你只是逐行复制代码,fine的啦。

This helps you break open the black box of machine learning while solidifying your understanding of the applied statistics required for data science.

这有助于您打破机器学习的黑盒,同时巩固您对数据科学所需要的应用统计学的理解。

The following models were chosen because they illustrate several of the key concepts from earlier.

以下模型说明了前面的几个关键概念。

Linear Regression

线性回归

First, we have the poster child of predictive modeling...

首先,我们有预测建模的典范。

Linear Regression from Scratch in Python

Naive BayesClassifier

朴素贝叶斯分类器

Next, we have an embarrassingly simple model that works pretty darn well...

接下来,我们有一个令人尴尬的简单模型,它运行得非常好......

Intuitive Introduction,Naive Bayes from Scratch in Python

Multi-Armed Bandits

"多臂赌博机"模型——该模型用来解决这样的一类问题:即决策者面临多个战略选择,而每个战略选择会产生的后果只有在被选中之后才能知道.

And finally, we have the famous "20 lines of code that beat any A/B test!"

最后,我们有了著名的“20行代码,打败了任何A/B测试!”

Intuitive Introduction,Multi-Armed Bandits from Scratch in Python

If you're hungry for more, we recommend the following resource. We'll also be coming out with a detailed guide for learning machine learning the self-starter way, so stay tuned.

如果您想要更多,我们推荐以下资源。我们还会提供一份详细的学习机器学习指南,让我们继续学习。

For your reference...

Introduction to Statistical Machine Learningis a wonderful textbook (with free PDF version) that you can use as a reference. The examples are in R, and the book covers a much broader range of topics, making this a valuable tool as you progress into more work in machine learning.

统计机器学习导论是一本很棒的教科书(有免费的PDF版本),你可以用它作为参考。这些例子使用R语言,本书涵盖了更广泛的主题,使之成为一个有价值的工具,当你在机器学习中取得更多的成果。

原文地址:https://elitedatascience.com/learn-statistics-for-data-science

More Resources

How to Learn Math for Data Science, The Self-Starter Way

6 Fun Machine Learning Projects for Beginners

Supercharge Your Data Science Career: 88 Free Resources

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 194,319评论 5 459
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 81,801评论 2 371
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 141,567评论 0 319
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 52,156评论 1 263
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 61,019评论 4 355
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 46,090评论 1 272
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 36,500评论 3 381
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 35,192评论 0 253
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 39,474评论 1 290
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 34,566评论 2 309
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 36,338评论 1 326
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 32,212评论 3 312
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 37,572评论 3 298
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 28,890评论 0 17
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 30,169评论 1 250
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 41,478评论 2 341
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 40,661评论 2 335

推荐阅读更多精彩内容