CSCI567: Machine Learning

my wechat：Yooo932851

Don't hesitate to contact me

Theory-based Questions

Problem 1: Perceptron Convergence (15pts)

Recall the perceptron algorithm that we saw in class. The perceptron algorithm comes with strong theory, and you will explore some of this theory in this problem. We begin with some remarks related to notation which are valid throughout the homework. Unless stated otherwise, scalars are denoted by small letters in normal font, vectors are denoted by small letters in bold font, and matrices are denoted by capital letters in bold font.

Problem 1 asks you to show that when the two classes in a binary classification problem are linearly separable, then the perceptron algorithm willconverge. For the sake of this problem, we define convergence as predicting the labels of all training instances perfectly. The perceptron algorithm is described in Algorithm 1. It gets access to a dataset ofn instances (xi , yi), wherexi ∈ Rd andyi ∈ {−1, 1}. It outputs a linear classifierw.

Algorithm 1 Perceptron

while not convergeddo

Pick a data point (xi , yi) randomly

Make a predictionyˆ = sgn(wT xi) using currentw

if yˆ =yi then

w ← w +yixi

end if

end while

Assume there exists an optimal hyperplanewopt, ∥wopt∥2 = 1 and someγ > 0 such thatyi(w T optxi)≥ γ, ∀i ∈ {1, 2, . . . , n}. Additionally, assume∥xi∥2≤ R, ∀i ∈ {1, 2, ..., n}. Following the steps below, show that the percep� tron algorithm makes at mostR2/γ 2 errors, and therefore the algorithm must converge.

1.1 (5pts) Show that if the algorithm makes a mistake, the update rule moves it towards the direction of the optimal weightswopt. Specifically, denoting explicitly the updating iteration index byk, the current weight vector bywk, and the updated weight vector bywk+1, show that, ifyi(wk T xi)< 0, we have

wk T +1wopt≥ wk T wopt +γ∥wopt∥2. (1)

1.2 (4pts) Show that the length of updated weights does not increase by a large amount. In particular, show that ifyi(wk T xi)< 0, then

∥wk+1∥ 2 2≤ ∥wk∥ 2 2 +R 2. (2)

1.3 (5pts) Assume that the initial weight vectorw0 =0 (an all-zero vector). Using results from the previous two parts, show that for any iterationk + 1, withM being the total number of mistakes the algorithm has made for the firstk iterations, we have

γM ≤ ∥wk+1∥2≤ R √ M. (3)

Hint: use the Cauchy-Schwartz inequality: a T b ≤ ∥a∥2∥b∥2.

1.4 (1pts) Use1.3 to conclude thatM ≤ R2/γ2 . Therefore the algorithm can make at mostR2/γ2 mistakes (note that there is no direct dependence on the dimensionality of the datapoints).

Problem 2: Logistic Regression (10pts)

This problems explores some properties of the logistic function to get you more comfortable with it. Consider the following univariate function first:

F(x;A, k, b) =A 1 +e−k(x−b) (4)

2.1 (3pts) Describe with words howA, k andb affect or are related to the shape ofF(x;A, k, b) by using the plot at https://www.geogebra.org/graphing/mxw7wp8b. Note: We are trying to learn the parameters of a logistic function that can fit the dataset. For example, see the figure below and guess which of the four functions fit well.

2.2 (1pt) For what values ofA, k andb isF(x;A, k, b) a cumulative distribution function (CDF)?

2.3 (2pts) Now letF(x;A, k, b) = Pr[y = 1|x;A, k, b], the probability ofx belonging to the class 1 in a binary classification problem. This is similar to the logistic model we saw in Lecture 2. You can verify that for the values ofA, k, b you found in the previous part,F(x;A, k, b) is a valid probability as well. Suppose we knowF(x;A, k, b) for a datapointx and and want to predict its label. We need to decide on a threshold value forF(x;A, k, b), above which we will predict the label 1 and below which we will predict−1. Show that setting the threshold to beF(x;A, k, b)≥ 1/2 minimizes the classification error.

For the logistic model, the valuex =x0 for whichF(x0) = 0.5 is called the decision boundary. In the univariate case, it is a point in on thex-axis, in the multivariate case (next question), it is a hyperplane. There is a rich literature on decision theory which studies how to make decisions if we know the probabilities of the underlying events (see https://mlstory.org/pdf/prediction.pdf to learn more if you’re interested).

2.4 (4pts) We now consider the multivariate version. Consider the function:

F(x;w, b) = 1 1 +e−wTx+b . (5)

You can explore a 2D version of this function at https://www.geogebra.org/3d/g9fjtdeh wherew = (w1, w2). Match the items in the left column with those on the right. It will help if you can get the expression for the gradient ofF(x;w, b) w.r.tx. Provide a short explanation of your answers.

Problem 3: Learning rectangles (15pts)

An axis aligned rectangle classifier is a classifier than assigns the value 1 to a point if and only if it is inside a certain rectangle. Formally, given real numbersa1≤ b1, a2≤ b2, define the classifierf(a1,b1,a2,b2) on an inputx with coordinates (x1, x2) by

The function class of all axis-aligned rectangles in the plane is defined as

F2rec ={f(a1,b1,a2,b2)(x1, x2) :a1≤ b1, a2≤ b2}.

We will assume that the true labelsy of the datapoints (x, y) are given by some axis-aligned rectangle (this is the realizability assumption discussed in class). The goal of this question is to come up with an algorithm which gets small classification error with respect to any distributionD over (x, y) with good probability.

The loss function we use throughout the question is the 0-1 loss. It will be convenient to denote a rectangle marked by corners (a1, b1, a2, b2) asB(a1, b1, a2, b2). LetB∗ =B(a ∗ 1, b∗ 1, a∗ 2, b∗ 2 ) be the rectangle corresponding to the func� tionf(a ∗ 1,b∗ 1,a∗ 2,b∗ 2 ) which labels the datapoints. LetS ={(xi , yi), i ∈ [n]} be a training set ofn samples drawn i.i.d. fromD. Please see Fig. 2 for an example.

3.1 (3pts) We will follow the general supervised learning framework from class. Given the 0-1 loss, and the func� tion class of axis-aligned rectangles, we want to find an empirical risk minimizer. Consider the algorithm which returns the smallest rectangle enclosing all positive examples in the training set. Prove that this algorithm is an empirical risk minimizer.

3.2 (2pts) Our next task is to show that the algorithm from the previous part not only does well on the training data, but also gets small classification error with respect to the distributionD. LetBS be the rectangle returned by the algorithm in the previous part on training setS, and letfS ERM be the corresponding hypothesis. First, we will convince ourselves that generalization is inherently a probabilistic statement. Let abad training setS ′ be a training set such thatR(fS ERM ′ )≥ 0.5. Pick some simple distributionD and ground-truth rectangleB∗ , and give a short explanation for why there is a non-zero probability of seeing a bad training set.

3.3 (5pts) We will now show that with good probability over the training datasetS,fS ERM does get small error. Show that ifn ≥ 4 log(4/δ)ϵ then with probability at least 1− δ,R(fS ERM)≤ ϵ.

To prove this follow the following steps. Leta1≥ a ∗ 1 be such that the probability mass (with respect toD) of the rectangleB1 =B(a ∗ 1, a1, a∗

2, b∗ 2 ) is exactlyϵ/4. Similarly, letb1, a2, b2 be numbers such that the probability mass (with respect toD) of the rectanglesB2 =B(b1, b∗ 1, a∗ 2, b∗ 2 ), B3 =B(a ∗ 1, b∗ 1, a∗ 2, a2), B4 =B(a ∗ 1, b∗ 1, b2, b∗ 2 ) are all exactlyϵ/4.

• Show thatBS ⊆ B∗ .

• Show that ifS contains (positive) examples in all of the rectanglesB1, B2, B3, B4 thenR(fS ERM)≤ ϵ.

• For eachi ∈ {1, . . . , 4} upper bound the probability thatS does not contain an example fromBi .

• Use the union bound to conclude the argument.

3.4 (5pts) Repeat the previous question for the function class of axis-aligned rectangles in Rd . To show this, try to generalize all the steps in the above question to higher dimensions, and find the number of training samplesn required to guarantee thatR(fS ERM)≤ ϵ with probability at least 1− δ.

Programming-based Questions

Before you start to conduct homework in this part, you need to first set up the coding environment. We use python3 (version≥ 3.7) in our programming-based questions. There are multiple ways you can install python3, for example:

• You can use conda to configure a python3 environment for all programming assignments.

• Alternatively, you can also use virtualenv to configure a python3 environment for all programming assignments After you have a python3 environment, you will need to install the following python packages:

• numpy

• matplotlib (for you plotting figures)

Note: You arenot allowed to use other packages, such astensorflow, pytorch, keras, scikit-learn, scipy, etc. to help you implement the algorithms you learned. If you have other package requests, please ask first before using them.

Problem 4: k-nearest neighbor classification and the ML pipeline (25pts)

In class, we talked about how we need to do training/test split to make sure that our model is generalizing well. We also discussed how we should not reuse a test set too much, because otherwise the test accuracy may not be an accurate measure of the accuracy on the data distribution. In reality, a ML model often has manyhyper-parameters which need to be tuned (we will see an example in this question). We don’t want to use the test set over and over to see what the best value of this hyper-parameter is. The solution is to have a third split of the data, and create avalidation set. The idea is to use the validation set to evaluate results from the training set and tune any hyper-parameters. Then, use the

test set to double-check your evaluation after the model has “passed” the validation set. Please see this nice explana� tion for more discussion: https://developers.google.com/machine-learning/crash-course/ validation/another-partition.

With this final piece, we are now ready to build a real ML pipeline. It usually conducts three parts. (1) Load and pre-process the data. (2) Train a model on the training set and use the validation set to tune hyper-parameters. (3) Evaluate the final model on the test set and report the result. In this problem, you will implement thek-nearest neighbor (k-NN) algorithm to conduct classification tasks. We provide the bootstrap code and you are expected to complete the classes and functions. You can find the code and data on https://vatsalsharan.github.io/fall22/knn_hw1.zip.

k-NN algorithm

Thek-nearest neighbor (k-NN) algorithm is one of the simplest machine learning algorithms in the supervised learning paradigm. The idea behindk-NN is simple, and we explain it first for the case ofk = 1. The 1-NN algorithm predicts the label of any new datapointx by finding its closest neighborx ′ in the training set, and then predicts the label ofx to be the same as the label ofx ′ . For generalk, thek-NN algorithm predicts the label by taking a majority vote on thek nearest neighbors.

We now describe the algorithm more rigorously. Given a hyper-parameterk, training instances (xi , yi) (xi andyi is the label), and a test examplex, thek-NN algorithm can be executed based on the following steps,∈ Rd

1. Calculate thedistances between the test example and each of the training examples.

2. Take thek nearest neighbors based on the distances calculated in the previous step.

3. Among thesek nearest neighbors, count the number of the data points in each class.

4. Predict the labelyˆ ofx to be the most frequent class among these neighbors (we describe how to break any ties later).

You are asked to implement the missing functions in knn.py following each of the steps.

Part 4.1 Report 4-nearest neighbor accuracy (8pts)

Euclidean distance calculation

Compute the distance between the data points based on the following equation:

Task: Fill in the code for the function compute l2 distances .

k-NN classifier

Implement ak-NN classifier based on the above steps. Your algorithm should output the predictions for the test set.Note: You do not need to worry about ties in the distance when finding thek nearest neighbor set.

However, when there are ties in the majority label of thek nearest neighbor set, you should return the label with the smallest index. For example, whenk = 4, if the labels of the 4 nearest neighbors happen to be 0, 0, 1, 1, your prediction should be the label 0.

Task: Fill in the code for the function predict labels .

Report the error rate

We want you to report the error rate for the classification task. The error rate is defined as:

Task: Fill in the code for the function compute error rate .

Report Item: Report the error rate of yourk nearest neighbor algorithm in the validation set whenk = 4 using

Euclidean distance.

Part 4.2 Data transformation (2+2pts)

We are going to add one more step (data transformation) in the data processing part and see how it works. Data trans�

formations and other feature engineering steps often play a crucial role to make a machine learning model work. Here,

we take two different data transformation approaches. This link might be helpful: https://en.wikipedia.

org/wiki/Feature_scaling.

Normalizing the feature vector

This one is simple but sometimes may work well. Given a feature vectorx, the

normalized feature vector is given by˜x =

∥x∥2

. If a vector is an all-zero vector, define the normalized vector to also

be the all-zero vector (in practice, a useful trick is to add a very very small value to the norm of

x, so that we do not

need to worry about the division by zero error).

Min-max scaling for each feature

The above normalization is independent of the rest of the data. On the other

hand, min-max normalization scales each sample in a way that depends on the rest of the data. More specifically, for

each feature in the training set, we normalize it linearly so that its value is between 0 and 1 across all samples, and

in addition, the largest value becomes exactly 1 while the smallest becomes exactly 0. Then, we will apply the same

scaling parameters we get from training data to our testing instances.

Task: Fill in the code for the function data processing with transformation .

Report Item: Report the error rate of yourk nearest neighbor algorithm in the validation set fork = 4 using Euclidean

distance when data is using (1) Normalized featured vector, and (2) Min-max scaling featured vector.

Part 4.3 Different distance measurement (3pts)

In this part, we will change the way that we measure distances. We will work with the original (unnormalized) data

for simplicity, and continue the experiments from Part 4.1.

CSCI567: Machine Learning

推荐阅读更多精彩内容