回归

1.连续输出

本课关于连续(变量)的监督式学习
The output variable has been constrained to binary values in our previous setup.
This output is called discrete
但在很多学习问题中，输出也可以是连续的

image.png

2. 连续

continuous supervised learning
continuous--output

3. 年龄：连续还是离散？

连续输出离散输出

4. 天气：连续还是离散？

我们视为离散的多数事物其实在某种程度上是连续的

7. 收入：连续还是离散？

我们将一个变量看成连续时，我们其实暗示其有一定的次序，即可以比较大小

8.连续特征

分类通常意味着离散输出，在我们的地形分类问题中，输出变量是快速/慢速，如果将输出推广为连续输出，最好的办法是speed in mile

9.斜率和截距(回归线性方程)

目标变量/尝试预测的变量/输出 = 斜率(slope)*输入变量+截距(intercept)
slope -- define how steep the curve goes up 定义了曲线上升的陡度
a larger slope makes it go up faster
in the situation of negative slope,the graph would go down
截距：与纵轴交点的坐标

17. 线性回归编码

from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit([[0,0],[1,1],[2,2]],[0,1,2])
clf.coef_                                 #读取系数 斜率

数据集(年龄-净资产) →(分拆)训练集&测试集
在数据中加入噪音，这样就不是完美的关系

用训练集拟合直线，得到的就是回归的结果，再用这条线预测年龄在25-60间任何人的净资产

18. sklearn中的年龄/净值回归

#!/usr/bin/python   studentMain.py

import numpy
import matplotlib
matplotlib.use('agg')

import matplotlib.pyplot as plt
from studentRegression import studentReg
from class_vis import prettyPicture, output_image

from ages_net_worths import ageNetWorthData

ages_train, ages_test, net_worths_train, net_worths_test = ageNetWorthData()



reg = studentReg(ages_train, net_worths_train)


plt.clf()
plt.scatter(ages_train, net_worths_train, color="b", label="train data")
plt.scatter(ages_test, net_worths_test, color="r", label="test data")
plt.plot(ages_test, reg.predict(ages_test), color="black")
plt.legend(loc=2)
plt.xlabel("ages")
plt.ylabel("net worths")


plt.savefig("test.png")
output_image("test.png", "png", open("test.png", "rb").read())

#!/usr/bin/python   studentRegression.py
def studentReg(ages_train, net_worths_train):
    ### import the sklearn regression module, create, and train your regression
    ### name your regression reg
    
    ### your code goes here!
    
    from sklearn import linear_model
    reg = linear_model.LinearRegression()
    reg.fit(ages_train,net_worths_train)
    
    return reg

19. 通过sklearn提取信息

print "katie's net worth prediction: ", reg.predict([27])  #预测结果
print "slope:", reg.coef_                    #获取斜率
print "intercept:" ,reg.intercept_              #获取截距

20. 通过 sklearn 提取分数数据

评估回归的指标：r²，sum of errors
r²: 越大，回归性能越好 max=1

print "\n ######## stats on test dataset ########\n"
print "r-squared score: ",reg.score(ages_test,net_worths_test)  #通过使用测试集，可以察觉到过拟合等情况

print "\n ######## stats on training dataset ########\n"
print "r-squared score: ",reg.score(ages_train,net_worths_train)

21. 现在你练习提取信息

#!/usr/bin/python   regressionQuiz.py
import numpy
import matplotlib.pyplot as plt

from ages_net_worths import ageNetWorthData

ages_train, ages_test, net_worths_train, net_worths_test = ageNetWorthData()



from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(ages_train, net_worths_train)

### get Katie's net worth (she's 27)
### sklearn predictions are returned in an array, so you'll want to index into
### the output to get what you want, e.g. net_worth = predict([[27]])[0][0] (not
### exact syntax, the point is the [0] at the end). In addition, make sure the
### argument to your prediction function is in the expected format - if you get
### a warning about needing a 2d array for your data, a list of lists will be
### interpreted by sklearn as such (e.g. [[27]]).
km_net_worth = reg.predict([[27]]) ### fill in the line of code to get the right value

### get the slope
### again, you'll get a 2-D array, so stick the [0][0] at the end
slope = reg.coef_ ### fill in the line of code to get the right value

### get the intercept
### here you get a 1-D array, so stick [0] on the end to access
### the info we want
intercept = reg.intercept_ ### fill in the line of code to get the right value


### get the score on test data
test_score = reg.score(ages_test,net_worths_test) ### fill in the line of code to get the right value


### get the score on the training data
training_score = reg.score(ages_train,net_worths_train) ### fill in the line of code to get the right value



def submitFit():
    # all of the values in the returned dictionary are expected to be
    # numbers for the purpose of the grader.
    return {"networth":km_net_worth,
            "slope":slope,
            "intercept":intercept,
            "stats on test":test_score,
            "stats on training": training_score}

#!/usr/bin/python   ages_net_worths.py
import numpy
import random

def ageNetWorthData():

    random.seed(42)
    numpy.random.seed(42)

    ages = []
    for ii in range(100):
        ages.append( random.randint(20,65) )
    net_worths = [ii * 6.25 + numpy.random.normal(scale=40.) for ii in ages]
### need massage list into a 2d numpy array to get it to work in LinearRegression
    ages       = numpy.reshape( numpy.array(ages), (len(ages), 1))
    net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1))

    from sklearn.cross_validation import train_test_split
    ages_train, ages_test, net_worths_train, net_worths_test = train_test_split(ages, net_worths)

    return ages_train, ages_test, net_worths_train, net_worths_test

22. 线性回归误差

评估线性回归：

可视化：将回归的结果放在散点图上
查看线性回归产生的误差
线性拟合会有关联性的误差
误差=实际结果-输出的预测结果

image.png

24. 误差和拟合质量

一个好的拟合可以将以下两种误差最小化：

所有误差的绝对值的和
所有误差的平方和

25. 最小化误差平方和

当执行线性回归时，要做的是最大程度地降低误差平方和
这意味着，最佳回归是最小化误差平方和的回归、
因此我们要做的是找到能够使得误差平方和最小的m和b

image.png

26. 最小化误差平方和的算法

*ordinary least squares(OLS) 普通最小二乘法
used in sklearn LinearRegression
*gradient descent 梯度下降法

28. 最小化绝对误差的问题

there can be multiple lines that minimize Σ|error|,but only one line will minimize Σerror²
use sum of squared error also makes implementation much easier.
使用最小化误差平方和时，更容易找到回归线

29. 肉眼评估回归

哪一个回归能更好地拟合数据集？

image.png

这两条线基本上都恰当地对图形和数据进行了描述

30.SSE的问题

image.png

上面图中的两个线性回归都很好地拟合了数据，两个拟合结果间没有太大差异
但是右边的拟合会产生更大的误差平方和
通常来讲，更大的误差平方和意味着拟合得更差
因此这是误差平方和的一个不足之处，因为添加的数据越多，误差平方和几乎必定会增加，但并不代表拟合得不好，误差平方和会因为所使用的数据点的数量出现偏差，尽管拟合不存在太大问题，下面介绍评估回归的另一个指标r².

31. 回归的 R 平方指标

r² of a regression
描述线性回归的拟合良好度
-- is a number that effectively ask the question:
how much of my change in the output(y) is explained by the change in my input(x)
-- 0.0＜r² ＜1.0
-- the number is small,means that your regression line isn't doing a good job of capturing the trend in the data
--the number is large ,close to 1,means your regression line is doing a good job of describing the relationship between your input (x) and your output (y)
--优点在于与训练集的数量无关，即不受数据集中数据数量的影响
--比误差平方和更可靠一些，尤其是在数据集中的数据数量可能会改变的情况下

32. sklearn中r²

如果我们能够整合其他特征中的信息，就能更好地进行预测，即获得更高的r²

from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(ages_train,net_worths_train)
print "katie's net worth prediction:" , reg.predict([27])
print "r-squared score:",reg.score(ages_test,net_worths_test)
print "slope:",reg.coef_
print "intercept:",reg.intercept_

33.可视化回归

plt.scatter(ages,net_worths)
plt.plot(ages,reg.predict(ages),color='blue',linewidth=3)
plt.xlabel('ages')
plt.ylabel('net_worths')
plt.show()

34.什么数据适用于线性回归

可以使用线性回归的意思是，可以写出y=mx+b,这个等式可以很好地描述数据中的趋势

image.png

对于右下角的抛物线形状，可以通过使用特征转换拟合非线性关系。例如，添加平方 x 项作为功能会得到多项式回归，这可以视为多元线性回归的特殊情况，

35. 比较分类与回归

image.png

输出类型
监督分类：类标签的形式是离散的
回归：输出是连续的
真正尝试查找的东西：
监督分类：
-- 决策边界
-- 根据点相对于决策边界的位置，可为其赋予一个类标签
-- 描述数据的边界
回归：
-- 最优拟合线
-- 拟合数据的线条
评估
监督分类：
-- 准确率 accuracy

回归：
-- 误差平方和 sum of squared error
-- r的平方 r²

36.多元回归/多变量回归/multi-variable regression

有很多不同的输入变量来预测输出

image.png

38.回归迷你项目简介

通过工资预测奖金
异常值：在我们的模式以外很远的一个点
异常值如何影响通过回归得到的结果

在此项目中，你将使用回归来预测安然雇员和合伙人的财务数据。一旦你知道某位雇员的财务数据，比如工资，你是否会预测他们奖金的数额？

40.奖金目标和特征

运行在 regression/finance_regression.py 中找到的初始代码。这将绘制出一个散点图，其中有所有的数据点。你尝试预测什么目标？用来预测目标的输入特征是什么？
在脑海中描绘出你大致预测的回归线（如果打印散点图并用纸笔来描绘，效果会更好）。

image.png

input:salary
output:bonus

41. 可视化回归数据

就像在分类中一样，你需要在回归中训练和测试数据。这在初始代码中已被设定。将 test_color 的值从“b”改为“r”（针对“red”），然后重新运行。
注意：对于将 Python 2 代码转换至 Python 3 的学员，请参见以下关于兼容性的重要备注。
你将仅使用蓝色（训练）点来拟合回归。（你可能已经注意到，我们放入测试集的是 50% 的数据而非标准的 10%—因为在第 5 部分中，我们将改变训练和测试数据集，并且平均分割数据使这种做法更加简单。）

从 Python 3.3 版本开始，字典的键值顺序有所改变，在每次代码运行时，字典的键值皆为随机排序。这会让我们在 Python 2.7 环境下工作的评分者遭遇一些兼容性的问题。为了避免这个问题，请在 finance_regression.py 文件的第26行 featureFormat 调用时添加一个参数
sort_keys = '../tools/python2_lesson06_keys.pkl'
它会打开 tools 文件夹中带有 Python 2 键值顺序的数据文件。

42.提取斜率和截距

from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(feature_train,target_train)
print reg.coef_          #5.448
print reg.intercept_    #-102360.54

43. 回归分数：训练数据

假设你是一名悟性不太高的机器学习者，你没有在测试集上进行测试，而是在你用来训练的相同数据上进行了测试，并且用到的方法是将回归预测值与训练数据中的目标值（比如：奖金）做对比。
你找到的分数是多少？你可能对“良好”分数还没有概念；此分数不是非常好（但却非常糟糕）。
···
print reg.score(feature_train,target_train) #0.0455
···

44. 回归分数：测试数据

现在，在测试数据上计算回归的分数。
测试数据的分数是多少？如果只是错误地在训练数据上进行评估，你是否会高估或低估回归的性能？

print reg.score(feature_test,target_test)   #-1.48

45. 根据 LTI 回归奖金

我们有许多可用的财务特征，就预测个人奖金而言，其中一些特征可能比余下的特征更为强大。例如，假设你对数据做出了思考，并且推测出“long_term_incentive”特征（为公司长期的健康发展做出贡献的雇员应该得到这份奖励）可能与奖金而非工资的关系更密切。
证明你的假设是正确的一种方式是根据长期激励回归奖金，然后看看回归是否显著高于根据工资回归奖金。根据长期奖励回归奖金—测试数据的分数是多少？

import sys
import pickle
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit
dictionary = pickle.load( open("../final_project/final_project_dataset_modified.pkl", "r") )

### list the features you want to look at--first item in the 
### list will be the "target" feature
features_list = ["bonus", "long_term_incentive"]
data = featureFormat( dictionary, features_list, remove_any_zeroes=True)
target, features = targetFeatureSplit( data )

### training-testing split needed in regression, just like classification
from sklearn.cross_validation import train_test_split
feature_train, feature_test, target_train, target_test = train_test_split(features, target, test_size=0.5, random_state=42)
train_color = "b"
test_color = "r"



### Your regression goes here!
### Please name it reg, so that the plotting code below picks it up and 
### plots it correctly. Don't forget to change the test_color above from "b" to
### "r" to differentiate training points from test points.
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(feature_train,target_train)
print reg.coef_
print reg.intercept_
print reg.score(feature_test,target_test)    #-0.59

### draw the scatterplot, with color-coded training and testing points
import matplotlib.pyplot as plt
for feature, target in zip(feature_test, target_test):
    plt.scatter( feature, target, color=test_color ) 
for feature, target in zip(feature_train, target_train):
    plt.scatter( feature, target, color=train_color ) 

### labels for the legend
plt.scatter(feature_test[0], target_test[0], color=test_color, label="test")
plt.scatter(feature_test[0], target_test[0], color=train_color, label="train")

### draw the regression line, once it's coded
try:
    plt.plot( feature_test, reg.predict(feature_test) )
except NameError:
    pass
plt.xlabel(features_list[1])
plt.ylabel(features_list[0])
plt.legend()
plt.show()

46. 工资与预测奖金的 LTI

如果你需要预测某人的奖金，你是通过他们的工资还是长期奖金来进行预测呢？
long_term_incentive

47.异常值破坏回归

这是下节课的内容简介，关于异常值的识别和删除。返回至之前的一个设置，你在其中使用工资预测奖金，并且重新运行代码来回顾数据。你可能注意到，少量数据点落在了主趋势之外，即某人拿到高工资（超过 1 百万美元！）却拿到相对较少的奖金。此为异常值的一个示例，我们将在下节课中重点讲述它们。
类似的这种点可以对回归造成很大的影响：如果它落在训练集内，它可能显著影响斜率/截距。如果它落在测试集内，它可能比落在测试集外要使分数低得多。就目前情况来看，此点落在测试集内（而且最终很可能降低分数）。让我们做一些处理，看看它落在训练集内会发生什么。在 finance_regression.py 底部附近并且在 plt.xlabel(features_list[1]) 之前添加这两行代码：
reg.fit(feature_test, target_test)
plt.plot(feature_train, reg.predict(feature_train), color="b")
现在，我们将绘制两条回归线，一条在测试数据上拟合（有异常值），一条在训练数据上拟合（无异常值）。来看看现在的图形，有很大差别，对吧？单一的异常值会引起很大的差异。
新的回归线斜率是多少？
（你会发现差异很大，多数情况下由异常值引起。下一节课将详细介绍异常值，这样你就有工具来检测和处理它们了。）

reg.fit(feature_test, target_test)
print reg.coef_   #2.27
plt.plot(feature_train, reg.predict(feature_train), color="b")

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 205,033评论 6赞 478
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 87,725评论 2赞 381
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 151,473评论 0赞 338
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,846评论 1赞 277
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,848评论 5赞 368
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,691评论 1赞 282
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,053评论 3赞 399
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,700评论 0赞 258
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 42,856评论 1赞 300
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,676评论 2赞 323
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,787评论 1赞 333
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,430评论 4赞 321
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 39,034评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,990评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,218评论 1赞 260
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 45,174评论 2赞 352
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,526评论 2赞 343

回归

1.连续输出

2. 连续

3. 年龄：连续还是离散？

4. 天气：连续还是离散？

7. 收入：连续还是离散？

8.连续特征

9.斜率和截距(回归线性方程)

17. 线性回归编码

18. sklearn中的年龄/净值回归

19. 通过sklearn提取信息

20. 通过 sklearn 提取分数数据

21. 现在你练习提取信息

22. 线性回归误差

24. 误差和拟合质量

25. 最小化误差平方和

26. 最小化误差平方和的算法

28. 最小化绝对误差的问题

29. 肉眼评估回归

30.SSE的问题

31. 回归的 R 平方指标

32. sklearn中r²

33.可视化回归

34.什么数据适用于线性回归

35. 比较分类与回归

36.多元回归/多变量回归/multi-variable regression

38.回归迷你项目简介

40.奖金目标和特征

41. 可视化回归数据

42.提取斜率和截距

43. 回归分数：训练数据

44. 回归分数：测试数据

45. 根据 LTI 回归奖金

46. 工资与预测奖金的 LTI

47.异常值破坏回归

推荐阅读更多精彩内容