使用sklearn中自带的手写数字数据集
from sklearn.datasets import load_digits
digits = load_digits()
digits.images.shape
(1797, 8, 8)
import matplotlib.pyplot as plt
fig,axes = plt.subplots(10,10,figsize=(8,8),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1,wspace=0.1))
for i, ax in enumerate(axes.flat):
ax.imshow(digits.images[i], cmap='binary', interpolation='nearest')
ax.text(0.05, 0.05, str(digits.target[i]),
transform=ax.transAxes, color='green')
X = digits.data
X.shape
(1797, 64)
y = digits.target
y.shape
(1797,)
图(略)
将原本1797个 8像素 8像素 的数据,平铺成64的一维数组,[n_samples, n_features] = (1797, 64)
即1797个样本,64个特征
使用流形学习算法中的Isomap对64维的数据进行降维
from sklearn.manifold import Isomap
iso = Isomap(n_components=2)
iso.fit(digits.data)
data_projected = iso.transform(digits.data)
data_projected.shape
(1797, 2)
plt.scatter(data_projected[:,0], data_projected[:,1], c=digits.target, edgecolor='none', alpha=0.5, cmap=plt.cm.get_cmap('Spectral',10))
plt.colorbar(label='digit label', ticks=range(10))
plt.clim(-0.5, 9.5)
观察各个数字在参数空间中的分离程度还可以,用一个非常简单的有监督分类算法就可以完成任务。
将数据分成训练集合测试集,然后用高斯朴素贝叶斯模型来拟合,准确率83%
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(Xtrain, ytrain)
y_model = model.predict(Xtest)
from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_model)
计算混淆矩阵(confusion matrix)
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(ytest, y_model)
sns.heatmap(mat, square=True, annot=True, cbar=False)
plt.xlabel('predicted value')
plt.ylabel('true value')
将识别错误的标记出来:
fig,axes = plt.subplots(10,10,figsize=(8,8),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1,wspace=0.1))
test_images = Xtest.reshape(-1, 8, 8)
for i, ax in enumerate(axes.flat):
ax.imshow(digits.images[i], cmap='binary', interpolation='nearest')
ax.text(0.05, 0.05, str(y_model[i]),
transform=ax.transAxes,
color='green' if (ytest[i] == y_model[i]) else 'red')
使用其他几种算法试一试:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
svm_model = SVC(gamma='auto')
svm_model.fit(Xtrain, ytrain)
y_pred_svm = svm_model.predict(Xtest)
print("SVM Accuracy: ", accuracy_score(ytest, y_pred_svm))
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=100)
rf_model.fit(Xtrain, ytrain)
y_pred_rf = rf_model.predict(Xtest)
print("Random Forest Accuracy: ", accuracy_score(ytest, y_pred_rf))
from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier(n_neighbors=3)
knn_model.fit(Xtrain, ytrain)
y_pred_knn = knn_model.predict(Xtest)
print("KNN Accuracy: ", accuracy_score(ytest, y_pred_knn))
import xgboost as xgb
xgb_model = xgb.XGBClassifier()
xgb_model.fit(Xtrain, ytrain)
y_pred_xgb = xgb_model.predict(Xtest)
print("XGBoost Accuracy: ", accuracy_score(ytest, y_pred_xgb))
使用了支持向量机(SVM)、随机森林(Random Forest)、K最近邻(KNN)、梯度提升决策树(GBDT)XGBoost
SVM Accuracy: 0.4866666666666667
Random Forest Accuracy: 0.9777777777777777
KNN Accuracy: 0.9866666666666667
XGBoost Accuracy: 0.9555555555555556
重点关注一下,导入模型、初始化模型、拟合数据、预测数据,这个步骤:
graph LR
导入模型 --> 初始化模型 --> 拟合数据 --> 预测数据
参考:
[1]美 万托布拉斯 (VanderPlas, Jake).Python数据科学手册[M].人民邮电出版社,2018.