决策树

最早最常用的算法之一（确定决策边界）
非常可靠，非常直观，看着结果就能了解它的意思

可使用kernel参数将线性决策面改成非线性决策面
决策树可以通过简单的线性决策面为你做出非线性决策

可以用决策树一个接一个地处理多元线性问题，我们可以通过树这种简单的数据结构，来对数据进行分类

image.png

the algorithm decision tree learning,is to use a computer algorithm to find the decision boundaries automatically based on data.

决策树可用于：
分类器、回归

决策树编码

from sklearn import tree
X=[[0,0],[1,1]]
Y=[0,1]
clf = tree.DecisionTreeClassifier()   #创建分类器
clf = clf.fit(X,Y)               #使用训练数据拟合分类器
clf.predict([[2,2]])            #使用拟合好的分类器进行预测新的数据点，新数据点通常是测试集中的数据

def classify(features_train, labels_train):
    
    ### your code goes here--should return a trained decision tree classifer

    from sklearn import tree
    clf = tree.DecisionTreeClassifier()
    clf.fit(features_train,labels_train)
    
    return clf

决策树准确性

import sys
from class_vis import prettyPicture
from prep_terrain_data import makeTerrainData

import numpy as np
import pylab as pl

features_train, labels_train, features_test, labels_test = makeTerrainData()


#### your code goes here
from sklearn import tree
from sklearn.metrics import accuracy_score

clf = tree.DecisionTreeClassifier()
clf.fit(features_train,labels_train)
pred = clf.predict(features_test)

acc = accuracy_score(pred,labels_test)
### be sure to compute the accuracy on the test set
    
def submitAccuracies():
  return {"acc":round(acc,3)}

决策树参数
调整决策树参数来提高决策树的准确性
通过调整 min_samples_split (分割所需的最少样本数量)是否能减少过度拟合的现象
i have a decision tree,start out with a bunch of training examples,then start to split them into smaller sub_samples,at some point,i have to figure out if i am going to keep splitting further.
the question is for each one of these bottom layer of tree,whether i want to keep splitting it if it seems like that might be a good idea ?
that is what min_samples_split govern is basiclly whether i can keep splitting it or not.
whether there is enough samples that are available to me to continue to split further
default:min_samples_split=2

调整min_samples_split 后的决策树准确性

import sys
from class_vis import prettyPicture
from prep_terrain_data import makeTerrainData

import matplotlib.pyplot as plt
import numpy as np
import pylab as pl

features_train, labels_train, features_test, labels_test = makeTerrainData()



################ DECISION TREE ##############


### your code goes here--now create 2 decision tree classifiers,
### one with min_samples_split=2 and one with min_samples_split=50
### compute the accuracies on the testing data and store
### the accuracy numbers to acc_min_samples_split_2 and
### acc_min_samples_split_50, respectively


from sklearn import tree
from sklearn.metrics import accuracy_score
clf_2 = tree.DecisionTreeClassifier(min_samples_split=2)
clf_2.fit(features_train,labels_train)
pred_2 = clf_2.predict(features_test)
acc_min_samples_split_2 = accuracy_score(pred_2,labels_test)    #90.8%

clf_50 = tree.DecisionTreeClassifier(min_samples_split=50)
clf_50.fit(features_train,labels_train)
pred_50 = clf_50.predict(features_test)
acc_min_samples_split_50 = accuracy_score(pred_50,labels_test)  #91.2%




def submitAccuracies():
  return {"acc_min_samples_split_2":round(acc_min_samples_split_2,3),
          "acc_min_samples_split_50":round(acc_min_samples_split_50,3)}

min_samples_split不仅对决策边界的外观有影响，还对分类器的性能有影响
我们可以通过调整参数来优化算法的性能

数据杂质与熵
entropy
--the thing that controls how a decision tree decides where to split the data
--definition:measure of impurity in a bunch of examples

image.png

建立决策树其实就是找到变量，找到变量分割点，从而产生尽可能均一(make subsets as pure as possible)的子集
实际上决策树做决策的过程，就是对这个过程的递归重复

熵公式

image.png

某些来源使用其他的对数底（例如，它们可能使用对数底 10 或底为大约 2.72 的自然对数）——这些细节可能会改变你可以获得的熵的最大值。在我们的情况中（有 2 个类），我们使用的对数底为 2 的公式将具有最大值 1。
实际上，在使用决策树时，很少需要处理对数底的细节——这里的结论是，较低的熵指向更有条理的数据，而且决策树将此用作事件分类方式。

entropy is basically the opposite of the purity
so in one extreme situation ,you can have all the examples be of the same class,in that case ,entropy = 0
in other extreme situation ,you can have the examples be evenly split between all the available classes ,in this case ,entropy=1
熵计算

image.png

how many examples do i have total in this node?node--ssff---4
p sub slow= fraction of slow examples =0.5
p sub fast= fraction of fast examples =0.5

使用python计算熵

import math 
result = -0.5*math.log(0.5,2)-0.5*math.log(0.5,2)
print result

我们能够得到的最大熵值是1
这意味着这是我们能得到的纯度最低的样本
if we have two class labels,the most impure situation we could have is where the examples are evenly split between the two class labels

信息增益
熵实际上会如何影响决策树绘制其边界

information gain
information gain = entropy(parent)-[weighted average]entropy(children)

decision tree algorithm will maximize information gain
它通过这种方法来选择进行分割的特征，如果特征有多个可获取的不同值，这将帮助它找出在何处分割，它会尝试最大程度地提高信息增益。

calculate information gain

image.png

all of the examples in this node belong to the same class ,everything in this node is fast,so the entropy is 0

image.png

使用scipy 计算上述所需熵

import scipy.stats
print scipy.stats.entropy([2,1],base=2)

what's the information gain if i were to split based on grade?

image.png

what's the information gain if we make a split based on bumpniss? --0
we don't learn anything by splitting the train,these particular training examples based on the bumpiness of the train
因此如果我们想要建立一个决策树，那我们很可能不会选择在这里拆分我们的样本

what's the information gain when we make a split based on speed limit?--1
when we split based on speed limit,we get perfect purity of the bruches that we make as a result,we start out with an entropy of 1,at the end we have the entropy of 0,so the information gain is going to be 1,this is the best information gain we can have,definitely this is where we want to make a split
决策树在进行训练时就是在进行information gain的计算，它要考虑所有的样本以及可用的所有特征，然后使用‘信息增益最大化’这一准则来决定对哪个变量进行拆分以及如何拆分

调整标准参数
criterion='gini' 默认
gini is another similar metrics of impurity

偏差-方差困境
偏差方差和决策树以及各种监督分类器的关系
高偏差机器学习算法实际上会忽略数据，它几乎没有能力学习任何数据，因此被称为偏差
如果训练一个有偏差的汽车，那么无论通过何种方式，它的操作都不会有任何区别
另一种极端情况是，方差极高的算法，这种情况下，汽车对数据极度敏锐，并且只能复制曾今见过的东西，它的问题在于，它在之前未见过的情况下反应非常差，因为它没有适当的偏差来泛化新东西
这就导致了存在偏差-方差权衡
you want an algorithm that has some authority to generalize,but is still very open to listen to the data

decision tree的优缺点
优点：易于使用/使你能以图像方式更好地解释数据，这比支持向量机的结果好理解得多/通过集成方法从决策树中构建更大的分类器
缺点：当数据集包含大量的特征时，负责的决策树容易过拟合数据，所以在挑选决策树的参数时，需要谨慎，调整参数以防止发生过拟合
it is really import for you to measure how well you are doing,then stop the growth of the tree at the appropriate time

决策树迷你项目
在本项目中，我们将再次尝试确认邮件作者，但这次使用的是决策树。

计算准确率：

#!/usr/bin/python

    
import sys
from time import time
sys.path.append("../tools/")
from email_preprocess import preprocess


### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()

### your code goes here ###

from sklearn import tree
clf = tree.DecisionTreeClassifier(min_samples_split=40)
clf.fit(features_train,labels_train)
pred = clf.predict(features_test)

from sklearn.metrics import accuracy_score
acc  = accuracy_score(pred,labels_test)
print acc

通过特征选择加速

你从 SVM 迷你项目中了解到，参数调整可以显著加快机器学习算法的训练时间。一般情况下，参数可以调整算法的复杂度，越复杂的算法通常运行起来越慢。
控制算法复杂度的另一种方法是通过你在训练/测试时用到的特征数量。算法可用的特征数越多，越有可能发生复杂拟合。我们将在“特征选择”这节课中详细探讨，但你现在可以提前有所了解。
你数据中的特征数是多少？
（提示：数据被整理成一个 numpy 数组后，行数是数据点数，列数是特征数；要提取这个数字，只需运行代码 len(features_train[0])。）

len(features_train[0])     #提取数据点数
len(features_train[1])     #提取特征数

更改特征数量

进入 ../tools/email_preprocess.py，然后找到类似此处所示的一行代码：

selector = SelectPercentile(f_classif, percentile=10)

将百分位数从 10 改为 1，然后运行 dt_author_id.py
现在，特征数是多少？

结论：在其他所有方面都相等的情况下，特征数量越多会使决策树的复杂性更高

推荐阅读更多精彩内容