支持向量机
支持向量机
本部分练习,我们将在2D示例数据集上使用支持向量机。通过在这些数据集上使用支持向量机,将帮助我们初识支持向量机的运行原理,以及如何使用高斯核函数支持向量机。
任务一 示例数据集1
本任务要求我们修改参数C的值,观察支持向量机对数据集的判定边界。ex6.m文件中已将相关代码备好,其代码如下:
%% =============== Part 1: Loading and Visualizing Data ================
% We start the exercise by first loading and visualizing the dataset.
% The following code will load the dataset into your environment and plot
% the data.
%
fprintf('Loading and Visualizing Data ...\n')
% Load from ex6data1:
% You will have X, y in your environment
load('ex6data1.mat');
% Plot training data
plotData(X, y);
fprintf('Program paused. Press enter to continue.\n');
pause;
%% ==================== Part 2: Training Linear SVM ====================
% The following code will train a linear SVM on the dataset and plot the
% decision boundary learned.
%
% Load from ex6data1:
% You will have X, y in your environment
load('ex6data1.mat');
fprintf('\nTraining Linear SVM ...\n')
% You should try to change the C value below and see how the decision
% boundary varies (e.g., try C = 1000)
C = 1;
model = svmTrain(X, y, C, @linearKernel, 1e-3, 20);
visualizeBoundaryLinear(X, y, model);
fprintf('Program paused. Press enter to continue.\n');
pause;
当参数C = 1时,其运行结果如下:
当参数C = 100时,其运行结果如下:
任务二 高斯核函数支持向量机
本部分我们使用高斯核函数支持向量机对数据集进行非线性地划分。
- 首先我们需要使用高斯核函数构建新的特征变量fi,i = 1, 2, 3, ......因此,我们需要在gaussianKernel.m文件中,将高斯核函数补充完整。
参考代码如下:
sim = exp(-(x1 - x2)' * (x1 - x2) / (2 * sigma * sigma));
在ex6.m文件中运行如下部分代码,可得到的结果为:0.324652。
%% =============== Part 3: Implementing Gaussian Kernel ===============
% You will now implement the Gaussian kernel to use
% with the SVM. You should complete the code in gaussianKernel.m
%
fprintf('\nEvaluating the Gaussian Kernel ...\n')
x1 = [1 2 1]; x2 = [0 4 -1]; sigma = 2;
sim = gaussianKernel(x1, x2, sigma);
fprintf(['Gaussian Kernel between x1 = [1; 2; 1], x2 = [0; 4; -1], sigma = %f :' ...
'\n\t%f\n(for sigma = 2, this value should be about 0.324652)\n'], sigma, sim);
fprintf('Program paused. Press enter to continue.\n');
pause;
- 我们需要导入新的数据集-样例数据集2,通过使用高斯核函数支持向量机,对该数据集进行非线性划分。如下为ex6.m文件中已备好的相关代码。
%% =============== Part 4: Visualizing Dataset 2 ================
% The following code will load the next dataset into your environment and
% plot the data.
%
fprintf('Loading and Visualizing Data ...\n')
% Load from ex6data2:
% You will have X, y in your environment
load('ex6data2.mat');
% Plot training data
plotData(X, y);
fprintf('Program paused. Press enter to continue.\n');
pause;
%% ========== Part 5: Training SVM with RBF Kernel (Dataset 2) ==========
% After you have implemented the kernel, we can now use it to train the
% SVM classifier.
%
fprintf('\nTraining SVM with RBF Kernel (this may take 1 to 2 minutes) ...\n');
% Load from ex6data2:
% You will have X, y in your environment
load('ex6data2.mat');
% SVM Parameters
C = 1; sigma = 0.1;
% We set the tolerance and max_passes lower here so that the code will run
% faster. However, in practice, you will want to run the training to
% convergence.
model= svmTrain(X, y, C, @(x1, x2) gaussianKernel(x1, x2, sigma));
visualizeBoundary(X, y, model);
fprintf('Program paused. Press enter to continue.\n');
pause;
其运行结果为:
- 为了更进一步熟悉使用高斯核函数支持向量机,我们导入新的数据集——样例数据集3。在ex6data3.mat文件中,其将该数据集分为了两部分,即训练集和交叉验证集。我们需要将dataset3Params.m文件补充完整,即我们需要在参数C∈{0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30}和参数σ∈{0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30}中找到一个最优值,使其能够对数据集进行正确地划分。
dataset3Params.m文件的参考代码如下:
C_temp = [0.01; 0.03; 0.1; 0.3; 1; 3; 10; 30];
sigma_temp = [0.01; 0.03; 0.1; 0.3; 1; 3; 10; 30];
error_val = zeros(length(C_temp), length(sigma_temp));
for i = 1 : length(C_temp)
for j = 1 : length(sigma_temp)
model = svmTrain(X, y, C_temp(i), @(x1, x2) gaussianKernel(x1, x2, sigma_temp(j)));
predictions = svmPredict(model, Xval);
error_val(i, j) = mean(double(predictions ~= yval));
end
end
[I, J] = find(error_val == min(error_val(:))); % 找最小元素位置
C = C_temp(I) % 1
sigma = sigma_temp(J) % 0.100
ex6.m文件中本部分代码如下:
%% =============== Part 6: Visualizing Dataset 3 ================
% The following code will load the next dataset into your environment and
% plot the data.
%
fprintf('Loading and Visualizing Data ...\n')
% Load from ex6data3:
% You will have X, y in your environment
load('ex6data3.mat');
% Plot training data
plotData(X, y);
fprintf('Program paused. Press enter to continue.\n');
pause;
%% ========== Part 7: Training SVM with RBF Kernel (Dataset 3) ==========
% This is a different dataset that you can use to experiment with. Try
% different values of C and sigma here.
%
% Load from ex6data3:
% You will have X, y in your environment
load('ex6data3.mat');
% Try different SVM Parameters here
[C, sigma] = dataset3Params(X, y, Xval, yval);
% Train the SVM
model= svmTrain(X, y, C, @(x1, x2) gaussianKernel(x1, x2, sigma));
visualizeBoundary(X, y, model);
fprintf('Program paused. Press enter to continue.\n');
pause;
其运行结果为:
垃圾邮件分类器
通过使用支持向量机实现垃圾邮件过滤器。
任务一 电子邮件预处理
首先,我们导入样本数据,其内容如下:
众所周知,每封电子邮件都包含有类似于URLs和电子邮件地址等内容,但其主体内容是不同的。因此,为了提高分类效率,我们对这些类似的内容进行归一化处理,例如:将URLs用“httpaddr”文本代替等。该功能代码已在processEmail.m文件中准备好了。
我们运行ex6_spam.m文件中该部分代码,结果如下:
==== Processed Email ====
anyon know how much it cost to host a web portal well it depend on how mani
visitor you re expect thi can be anywher from less than number buck a month
to a coupl of dollarnumb you should checkout httpaddr or perhap amazon ecnumb
if your run someth big to unsubscrib yourself from thi mail list send an
email to emailaddr
任务二 词汇表
本部分我们利用已有的词汇表与样例邮件中的单词构建映射关系。因此,我们需要对processEmail.m文件进行补充,构建单词表映射关系,即样例邮件中的某一单词属于词汇表,则将其添加至word_indices变量;若不属于则跳过,验证下一单词。
processEmail.m文件的参考代码如下:
for i = 1 : length(vocabList)
if (strcmp(vocabList{i}, str))
word_indices = [word_indices; i];
end
end
任务三 提取样例邮件的特征变量
本部分将word_indices变量中的数据转换为特性变量x的向量模式,即
其中,若第i个单词在样例邮件中,则xi = 1;否则xi = 0。
emailFeatures.m文件中的参考代码如下:
x(word_indices) = 1;
任务四 使用支持向量机训练
本部分通过使用支持向量机模型对训练集进行训练,对交叉验证集进行验证。本部分代码如下:
%% =========== Part 3: Train Linear SVM for Spam Classification ========
% In this section, you will train a linear classifier to determine if an
% email is Spam or Not-Spam.
% Load the Spam Email dataset
% You will have X, y in your environment
load('spamTrain.mat');
fprintf('\nTraining Linear SVM (Spam Classification)\n')
fprintf('(this may take 1 to 2 minutes) ...\n')
C = 0.1;
model = svmTrain(X, y, C, @linearKernel);
p = svmPredict(model, X);
fprintf('Training Accuracy: %f\n', mean(double(p == y)) * 100);
%% =================== Part 4: Test Spam Classification ================
% After training the classifier, we can evaluate it on a test set. We have
% included a test set in spamTest.mat
% Load the test dataset
% You will have Xtest, ytest in your environment
load('spamTest.mat');
fprintf('\nEvaluating the trained Linear SVM on a test set ...\n')
p = svmPredict(model, Xtest);
fprintf('Test Accuracy: %f\n', mean(double(p == ytest)) * 100);
pause;
任务五 预测垃圾邮件
本部分使用支持向量机对新的电子邮件进行预测。本部分代码为:
%% =================== Part 6: Try Your Own Emails =====================
% Now that you've trained the spam classifier, you can use it on your own
% emails! In the starter code, we have included spamSample1.txt,
% spamSample2.txt, emailSample1.txt and emailSample2.txt as examples.
% The following code reads in one of these emails and then uses your
% learned SVM classifier to determine whether the email is Spam or
% Not Spam
% Set the file to be read in (change this to spamSample2.txt,
% emailSample1.txt or emailSample2.txt to see different predictions on
% different emails types). Try your own emails as well!
filename = 'spamSample1.txt';
% Read and predict
file_contents = readFile(filename);
word_indices = processEmail(file_contents);
x = emailFeatures(word_indices);
p = svmPredict(model, x);
fprintf('\nProcessed %s\n\nSpam Classification: %d\n', filename, p);
fprintf('(1 indicates spam, 0 indicates not spam)\n\n');