使用Jupyter notebook
%matplotlib qt
import numpy as np
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
- 读取txt数据,最后一列为标签
data = []
labels = []
with open('data\\datingTestSet.txt') as f:
for line in f:
tokens = line.strip().split('\t')
data.append([float(tk) for tk in tokens[:-1]])
labels.append(tokens[-1])
data[1:10]
np.unique(labels)
array(['didntLike', 'largeDoses', 'smallDoses'],
dtype='|S10')
- 处理字符标签为数字标签
x = np.array(data)
labels = np.array(labels)
y = np.zeros(labels.shape)
y[labels=='didntLike'] = 1
y[labels=='smallDoses'] = 2
y[labels=='largeDoses'] = 3
- 数据未归一化前
model = KNeighborsClassifier(n_neighbors=3)
model.fit(x,y)
print(model)
expected = y
predicted = model.predict(x)
print metrics.classification_report(expected,predicted,target_names=['didntLike','smallDoses','largeDoses'])
print metrics.confusion_matrix(expected,predicted)
结果:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=3, p=2,
weights='uniform')
precision recall f1-score support
didntLike 0.89 0.85 0.87 342
smallDoses 0.93 0.98 0.96 331
largeDoses 0.82 0.83 0.82 327
avg / total 0.88 0.88 0.88 1000
[[289 0 53]
[ 1 325 5]
[ 33 24 270]]
- 数据归一化到[0-1范围]
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(x)
X_train_minmax
array([[ 0.44832535, 0.39805139, 0.56233353],
[ 0.15873259, 0.34195467, 0.98724416],
[ 0.28542943, 0.06892523, 0.47449629],
...,
[ 0.29115949, 0.50910294, 0.51079493],
[ 0.52711097, 0.43665451, 0.4290048 ],
[ 0.47940793, 0.3768091 , 0.78571804]])
- 拆分训练数据与测试数据
from sklearn.cross_validation import train_test_split
''''' 拆分训练数据与测试数据 '''
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)
- 归一化后结果
n_neighbors = 3 K近邻的K取值为3
x_train, x_test, y_train, y_test = train_test_split(X_train_minmax, y, test_size = 0.2)
model = KNeighborsClassifier(n_neighbors=3)
model.fit(x_train,y_train)
print(model)
expected = y_test
predicted = model.predict(x_test)
print metrics.classification_report(expected,predicted,target_names=['didntLike','smallDoses','largeDoses'])
print metrics.confusion_matrix(expected,predicted)
结果:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=3, p=2,
weights='uniform')
precision recall f1-score support
didntLike 0.97 1.00 0.99 68
smallDoses 0.93 1.00 0.96 51
largeDoses 1.00 0.93 0.96 81
avg / total 0.97 0.97 0.97 200
[[68 0 0]
[ 0 51 0]
[ 2 4 75]]
小结:
归一化后的结果,与归一化前相差很大