交叉验证
目的:为了让被评估的模型更加准确可信
数据分类训练集和测试集,再将训练集分为训练和验证集。
- eg:将数据分成5份,其中一份作为验证集。然后经过5次(组)的测试(每次换一组作为验证集,将之前的验证当作之后的一个训练集),每次都更换不同的验证集。即得到5组模型的结果,取平均值作为最终结果。又称5折交叉验证。从而,所有数据都即作为过训练集,也作为过验证集。
交叉验证一般是和网格搜索一起使用的
网格搜索
也称为超参数搜索
作用:调参(eg:k-近邻算法中的超参数)
通常情况下,有很多参数是需要手动指定的(如k-近邻算法中的K值),这种叫超参数。但是手动过程繁杂,所以需要对模型预设几种超参数组合。每组超参数都采用交叉验证来进行评估。最后选出最优参数组合建立模型。
如果一个算法中有两个超参数(eg:a、b),如何进行网格搜索?
eg:a[2,3,5,8,10]、b[20,70,80] 则两两组合(15组),进行交叉验证
api:from sklearn.model_selection import GridSearchCV
案例一:
from sklearn.model_selection import GridSearchCV
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
def knncls():
"""
k-近邻预测用户签到位置
:return: None
"""
#读取数据
data = pd.read_csv("./train.csv")
#处理数据
#缩小数据范围
data = data.query("x > 1.0 & x < 1.25 & y > 2.5 & 7 < 2.75")
#处理时间的数据pd.to_datatime:把时间戳转换成日期格式
time_value = pd.to_datetime(data["time"], unit="s")
#把日期格式转换成字典格式,以可以获得时、分、秒等数据
time_value = pd.DatetimeIndex(time_value)
#构造一些特征
data["day"] = time_value.day #增加列:注意,数据的量应当一样
data["hour"] = time_value.hour
data["weekday"] = time_value.weekday
#把时间戳删除
data.drop(['time'], axis=1) #注意:pandas和sklearn中的列不一样,sklearn中是1表时列
"""注意:在pd中每一步操作都有返回值"""
#把签到数量少于n个目标位置删除
place_count = data.groupby("place_id").count() #此时,place_id就变成了索引
tf = place_count[place_count.row_id > 3].reset_index() #reset_index将索引变成一列数据。索引就变成了0、1、2...
data = data[data['place_id'].isin(tf.place_id)]
#取出数据当中的特征值和目标值
y = data['place_id']
x = data.drop(['place_id'], axis=1)
#进行数据的分割
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)
#特征工程(标准化)
std = StandardScaler()
x_train = std.fit_transform(x_train) #对测试集和训练集的特征值进行标准化
x_test = std.fit_transform(x_test)
#进行算法流程
knn = KNeighborsClassifier()
# """算法实例化时的参数称为超参数"""
# knn.fit(x_train, y_train)
# #得出预测结果
# y_predict = knn.predict(x_test)
# print("预测的目标签到位置为:", y_predict)
# #得出准确率
# print("预测的准确率:", knn.score(x_test, y_test))
#构造一些参数的值进行搜索
param = {"n_neighborrs":[3,5,10]}
gc = GridSearchCV(knn, param_grid=param, cv=10) #cv:cross verify:即交叉验证的组数,常用10
gc.fit(x_train, y_train)
#预测准确率(这里的准确率和交叉验证没有关系)
print("在测试集上的准确率:", gc.score(x_test, y_test))
print("在交叉验证当中最好的结果:", gc.best_score_)
print("选择的最好的模型是:", gc.best_estimator_) #即最好的k值
print("每个超参数每次交叉验证的结果:", gc.cv_results_)
return None
if __name__ == "__main__":
knncls()
案例二:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
def grid_searchCV():
data = load_iris()
x_train, x_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.25)
print("训练集特征值和目标值:\n", x_train, y_train)
print("测试集特征值和目标值:\n", x_test, y_test)
knn = KNeighborsClassifier()
param = {"n_neighbors":[3,5,10]}
gc = GridSearchCV(knn, param_grid=param, cv=10)
gc.fit(x_train, y_train)
print("在测试集上的准确率:\n", gc.score(x_test, y_test))
print("在交叉验证中的最好结果:\n", gc.best_score_)
print("选择的最后模型是:\n", gc.best_estimator_)
print("每个超参数每次交叉验证的结果是:\n", gc.cv_results_)
return None
if __name__ == "__main__":
grid_searchCV()
关键结果如下:
在测试集上的准确率:
E:\CodeDepot\Machine-Learning\Virtualenv\lib\site-packages\sklearn\model_selection\_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
0.8947368421052632
DeprecationWarning)
在交叉验证中的最好结果:
0.9821428571428571
选择的最后模型是:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=10, p=2,
weights='uniform')
每个超参数每次交叉验证的结果是:
{'mean_fit_time': array([0., 0., 0.]), 'std_fit_time': array([0., 0., 0.]), 'mean_score_time': array([0.00153615, 0. , 0.00156212]), 'std_score_time': array([0.00460846, 0. , 0.00468636]), 'param_n_neighbors': masked_array(data=[3, 5, 10],
mask=[False, False, False],
fill_value='?',
dtype=object), 'params': [{'n_neighbors': 3}, {'n_neighbors': 5}, {'n_neighbors': 10}], 'split0_test_score': array([0.92307692, 1. , 1. ]), 'split1_test_score': array([1., 1., 1.]), 'split2_test_score': array([1., 1., 1.]), 'split3_test_score': array([0.81818182, 0.81818182, 0.90909091]), 'split4_test_score': array([1., 1., 1.]), 'split5_test_score': array([0.90909091, 1. , 1. ]), 'split6_test_score': array([1. , 0.90909091, 0.90909091]), 'split7_test_score': array([1., 1., 1.]), 'split8_test_score': array([1. , 0.9, 1. ]), 'split9_test_score': array([1., 1., 1.]), 'mean_test_score': array([0.96428571, 0.96428571, 0.98214286]), 'std_test_score': array([0.05890454, 0.06062828, 0.03611785]), 'rank_test_score': array([2, 2, 1])}