本文之编写程序涉及到API介绍,程序的完整实现,具体算法原理请查看之前所写的KNN算法介绍
一、基础准备
1、python 基础
2、numpy 基础
np.mean
求平均值
print(np.mean([1,2,3,4]))
# >> 2.5
3、scikit 基础
fit
(X, y)
符合模型使用X作为训练数据和y值作为目标
get_params
([deep])
得到的参数估计量。
.
kneighbors
([X, n_neighbors, return_distance])
发现的K-neighbors点。
kneighbors_graph
([X, n_neighbors, mode])
计算(加权)图k-Neighbors X点
predict
(X)
预测类标签所提供的数据
predict_proba
(X)
回归测试数据的概率估计X。
score
(X, y[, sample_weight])
返回意味着在给定的精度测试数据和标签。
set_params
(**params)
设置的参数估计量。
.
二、完整程序
# -*- coding: utf-8 -*-
import numpy as np
from sklearn import neighbors, preprocessing
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import classification_report
from sklearn.cross_validation import train_test_split
def file2Mat(testFileName, parammterNumber):
fr = open(testFileName)
lines = fr.readlines()
lineNums = len(lines)
resultMat = np.zeros((lineNums, parammterNumber))
classLabelVector = []
for i in range(lineNums):
line = lines[i].strip()
itemMat = line.split('\t')
resultMat[i, :] = itemMat[0:parammterNumber]
classLabelVector.append(itemMat[-1])
fr.close()
return resultMat, classLabelVector;
# 为了防止某个属性对结果产生很大的影响,所以有了这个优化,比如:10000,4.5,6.8 10000就对结果基本起了决定作用
def autoNorm(dataSet):
minVals = dataSet.min(0)
maxVals = dataSet.max(0)
ranges = maxVals - minVals
normMat = np.zeros(np.shape(dataSet))
size = normMat.shape[0]
normMat = dataSet - np.tile(minVals, (size, 1))
normMat = normMat / np.tile(ranges, (size, 1))
return normMat, minVals, ranges
if __name__=='__main__':
trainigSetFileName = 'data\\datingTrainingSet.txt'
testFileName = 'data\\datingTestSet.txt'
# 读取训练数据
trianingMat, classLabel = file2Mat(trainigSetFileName, 3)
# 对数据进行归一化的处理
autoNormTrianingMat, minVals, ranges = autoNorm(trianingMat)
# 读取测试数据
testMat, testLabel = file2Mat(testFileName, 3)
autoNormTestMat = []
for i in range(len(testLabel)):
autoNormTestMat.append( (testMat[i] - minVals) / ranges)
# testMat = preprocessing.normalize(testMat)
print(autoNormTestMat)
# ''''' 训练KNN分类器 '''
clf = neighbors.KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree')
clf.fit(autoNormTrianingMat, classLabel)
answer = clf.predict(autoNormTestMat)
print(np.sum(answer != testLabel))
# 计算分数
print(clf.score(autoNormTestMat, testLabel))
print(np.mean(answer == testLabel))
print(clf.predict([0.44832535, 0.39805139, 0.56233353]))
print(clf.predict_proba([0.44832535, 0.39805139, 0.56233353]))
# '''''准确率与召回率'''
# precision, recall, thresholds = precision_recall_curve(testLabel, clf.predict(testMat))