CSCI567: Machine Learning

my wechat:Yooo932851

Don't hesitate to contact me

Theory-based Questions

Problem 1: Perceptron Convergence (15pts)

Recall the perceptron algorithm that we saw in class. The perceptron algorithm comes with strong theory, and you will explore some of this theory in this problem. We begin with some remarks related to notation which are valid throughout the homework. Unless stated otherwise, scalars are denoted by small letters in normal font, vectors are denoted by small letters in bold font, and matrices are denoted by capital letters in bold font.

Problem 1 asks you to show that when the two classes in a binary classification problem are linearly separable, then the perceptron algorithm willconverge. For the sake of this problem, we define convergence as predicting the labels of all training instances perfectly. The perceptron algorithm is described in Algorithm 1. It gets access to a dataset ofn instances (xi , yi), wherexi ∈ Rd andyi ∈ {−1, 1}. It outputs a linear classifierw.

Algorithm 1 Perceptron

while not convergeddo

Pick a data point (xi , yi) randomly

Make a predictionyˆ = sgn(wT xi) using currentw

if yˆ =yi then

w w +yixi

end if

end while

Assume there exists an optimal hyperplanewopt, ∥wopt2 = 1 and someγ > 0 such thatyi(w T optxi)≥ γ, ∀i ∈ {1, 2, . . . , n}. Additionally, assumexi∥2≤ R, ∀i ∈ {1, 2, ..., n}. Following the steps below, show that the percep� tron algorithm makes at mostR22 errors, and therefore the algorithm must converge.

1.1 (5pts) Show that if the algorithm makes a mistake, the update rule moves it towards the direction of the optimal weightswopt. Specifically, denoting explicitly the updating iteration index byk, the current weight vector bywk, and the updated weight vector bywk+1, show that, ifyi(wk T xi)< 0, we have

wk T +1woptwk T wopt +γ∥wopt2. (1)

1.2 (4pts) Show that the length of updated weights does not increase by a large amount. In particular, show that ifyi(wk T xi)< 0, then

wk+12 2≤ ∥wk∥ 2 2 +R 2. (2)

1.3 (5pts) Assume that the initial weight vectorw0 =0 (an all-zero vector). Using results from the previous two parts, show that for any iterationk + 1, withM being the total number of mistakes the algorithm has made for the firstk iterations, we have

γM ≤ ∥wk+12≤ R √ M. (3)

Hint: use the Cauchy-Schwartz inequality: a T b ≤ ∥a2b2.

1.4 (1pts) Use1.3 to conclude thatM ≤ R22 . Therefore the algorithm can make at mostR22 mistakes (note that there is no direct dependence on the dimensionality of the datapoints).

Problem 2: Logistic Regression (10pts)

This problems explores some properties of the logistic function to get you more comfortable with it. Consider the following univariate function first:

F(x;A, k, b) =A 1 +e−k(x−b) (4)

2.1 (3pts) Describe with words howA, k andb affect or are related to the shape ofF(x;A, k, b) by using the plot at https://www.geogebra.org/graphing/mxw7wp8b. Note: We are trying to learn the parameters of a logistic function that can fit the dataset. For example, see the figure below and guess which of the four functions fit well.

2.2 (1pt) For what values ofA, k andb isF(x;A, k, b) a cumulative distribution function (CDF)?

2.3 (2pts) Now letF(x;A, k, b) = Pr[y = 1|x;A, k, b], the probability ofx belonging to the class 1 in a binary classification problem. This is similar to the logistic model we saw in Lecture 2. You can verify that for the values ofA, k, b you found in the previous part,F(x;A, k, b) is a valid probability as well. Suppose we knowF(x;A, k, b) for a datapointx and and want to predict its label. We need to decide on a threshold value forF(x;A, k, b), above which we will predict the label 1 and below which we will predict1. Show that setting the threshold to beF(x;A, k, b)1/2 minimizes the classification error.

For the logistic model, the valuex =x0 for whichF(x0) = 0.5 is called the decision boundary. In the univariate case, it is a point in on thex-axis, in the multivariate case (next question), it is a hyperplane. There is a rich literature on decision theory which studies how to make decisions if we know the probabilities of the underlying events (see https://mlstory.org/pdf/prediction.pdf to learn more if you’re interested).

2.4 (4pts) We now consider the multivariate version. Consider the function:

F(x;w, b) = 1 1 +e−wTx+b . (5)

You can explore a 2D version of this function at https://www.geogebra.org/3d/g9fjtdeh wherew = (w1, w2). Match the items in the left column with those on the right. It will help if you can get the expression for the gradient ofF(x;w, b) w.r.tx. Provide a short explanation of your answers.

Problem 3: Learning rectangles (15pts)

An axis aligned rectangle classifier is a classifier than assigns the value 1 to a point if and only if it is inside a certain rectangle. Formally, given real numbersa1≤ b1, a2≤ b2, define the classifierf(a1,b1,a2,b2) on an inputx with coordinates (x1, x2) by

The function class of all axis-aligned rectangles in the plane is defined as

F2rec ={f(a1,b1,a2,b2)(x1, x2) :a1≤ b1, a2≤ b2}.

We will assume that the true labelsy of the datapoints (x, y) are given by some axis-aligned rectangle (this is the realizability assumption discussed in class). The goal of this question is to come up with an algorithm which gets small classification error with respect to any distributionD over (x, y) with good probability.

The loss function we use throughout the question is the 0-1 loss. It will be convenient to denote a rectangle marked by corners (a1, b1, a2, b2) asB(a1, b1, a2, b2). LetB∗ =B(a ∗ 1, b∗ 1, a∗ 2, b∗ 2 ) be the rectangle corresponding to the func� tionf(a ∗ 1,b∗ 1,a∗ 2,b∗ 2 ) which labels the datapoints. LetS ={(xi , yi), i ∈ [n]} be a training set ofn samples drawn i.i.d. fromD. Please see Fig. 2 for an example.

3.1 (3pts) We will follow the general supervised learning framework from class. Given the 0-1 loss, and the func� tion class of axis-aligned rectangles, we want to find an empirical risk minimizer. Consider the algorithm which returns the smallest rectangle enclosing all positive examples in the training set. Prove that this algorithm is an empirical risk minimizer.

3.2 (2pts) Our next task is to show that the algorithm from the previous part not only does well on the training data, but also gets small classification error with respect to the distributionD. LetBS be the rectangle returned by the algorithm in the previous part on training setS, and letfS ERM be the corresponding hypothesis. First, we will convince ourselves that generalization is inherently a probabilistic statement. Let abad training setS ′ be a training set such thatR(fS ERM ′ )0.5. Pick some simple distributionD and ground-truth rectangleB∗ , and give a short explanation for why there is a non-zero probability of seeing a bad training set.

3.3 (5pts) We will now show that with good probability over the training datasetS,fS ERM does get small error. Show that ifn ≥ 4 log(4)ϵ then with probability at least 1− δ,R(fS ERM)≤ ϵ.

To prove this follow the following steps. Leta1≥ a ∗ 1 be such that the probability mass (with respect toD) of the rectangleB1 =B(a ∗ 1, a1, a∗

2, b∗ 2 ) is exactlyϵ/4. Similarly, letb1, a2, b2 be numbers such that the probability mass (with respect toD) of the rectanglesB2 =B(b1, b∗ 1, a∗ 2, b∗ 2 ), B3 =B(a ∗ 1, b∗ 1, a∗ 2, a2), B4 =B(a ∗ 1, b∗ 1, b2, b∗ 2 ) are all exactlyϵ/4.

• Show thatBS ⊆ B∗ .

• Show that ifS contains (positive) examples in all of the rectanglesB1, B2, B3, B4 thenR(fS ERM)≤ ϵ.

• For eachi ∈ {1, . . . , 4} upper bound the probability thatS does not contain an example fromBi .

• Use the union bound to conclude the argument.

3.4 (5pts) Repeat the previous question for the function class of axis-aligned rectangles in Rd . To show this, try to generalize all the steps in the above question to higher dimensions, and find the number of training samplesn required to guarantee thatR(fS ERM)≤ ϵ with probability at least 1− δ.

Programming-based Questions

Before you start to conduct homework in this part, you need to first set up the coding environment. We use python3 (version3.7) in our programming-based questions. There are multiple ways you can install python3, for example:

• You can use conda to configure a python3 environment for all programming assignments.

• Alternatively, you can also use virtualenv to configure a python3 environment for all programming assignments After you have a python3 environment, you will need to install the following python packages:

• numpy

• matplotlib (for you plotting figures)

Note: You arenot allowed to use other packages, such astensorflow, pytorch, keras, scikit-learn, scipy, etc. to help you implement the algorithms you learned. If you have other package requests, please ask first before using them.

Problem 4: k-nearest neighbor classification and the ML pipeline (25pts)

In class, we talked about how we need to do training/test split to make sure that our model is generalizing well. We also discussed how we should not reuse a test set too much, because otherwise the test accuracy may not be an accurate measure of the accuracy on the data distribution. In reality, a ML model often has manyhyper-parameters which need to be tuned (we will see an example in this question). We don’t want to use the test set over and over to see what the best value of this hyper-parameter is. The solution is to have a third split of the data, and create avalidation set. The idea is to use the validation set to evaluate results from the training set and tune any hyper-parameters. Then, use the

test set to double-check your evaluation after the model has “passed” the validation set. Please see this nice explana� tion for more discussion: https://developers.google.com/machine-learning/crash-course/ validation/another-partition.

With this final piece, we are now ready to build a real ML pipeline. It usually conducts three parts. (1) Load and pre-process the data. (2) Train a model on the training set and use the validation set to tune hyper-parameters. (3) Evaluate the final model on the test set and report the result. In this problem, you will implement thek-nearest neighbor (k-NN) algorithm to conduct classification tasks. We provide the bootstrap code and you are expected to complete the classes and functions. You can find the code and data on https://vatsalsharan.github.io/fall22/knn_hw1.zip.

k-NN algorithm

Thek-nearest neighbor (k-NN) algorithm is one of the simplest machine learning algorithms in the supervised learning paradigm. The idea behindk-NN is simple, and we explain it first for the case ofk = 1. The 1-NN algorithm predicts the label of any new datapointx by finding its closest neighborx in the training set, and then predicts the label ofx to be the same as the label ofx . For generalk, thek-NN algorithm predicts the label by taking a majority vote on thek nearest neighbors.

We now describe the algorithm more rigorously. Given a hyper-parameterk, training instances (xi , yi) (xi andyi is the label), and a test examplex, thek-NN algorithm can be executed based on the following steps,Rd

1. Calculate thedistances between the test example and each of the training examples.

2. Take thek nearest neighbors based on the distances calculated in the previous step.

3. Among thesek nearest neighbors, count the number of the data points in each class.

4. Predict the labelyˆ ofx to be the most frequent class among these neighbors (we describe how to break any ties later).

You are asked to implement the missing functions in knn.py following each of the steps.

Part 4.1 Report 4-nearest neighbor accuracy (8pts)

Euclidean distance calculation

Compute the distance between the data points based on the following equation:

Task: Fill in the code for the function compute l2 distances .

k-NN classifier

Implement ak-NN classifier based on the above steps. Your algorithm should output the predictions for the test set.Note: You do not need to worry about ties in the distance when finding thek nearest neighbor set.

However, when there are ties in the majority label of thek nearest neighbor set, you should return the label with the smallest index. For example, whenk = 4, if the labels of the 4 nearest neighbors happen to be 0, 0, 1, 1, your prediction should be the label 0.

Task: Fill in the code for the function predict labels .

Report the error rate

We want you to report the error rate for the classification task. The error rate is defined as:

Task: Fill in the code for the function compute error rate .

Report Item: Report the error rate of yourk nearest neighbor algorithm in the validation set whenk = 4 using

Euclidean distance.

Part 4.2 Data transformation (2+2pts)

We are going to add one more step (data transformation) in the data processing part and see how it works. Data trans�

formations and other feature engineering steps often play a crucial role to make a machine learning model work. Here,

we take two different data transformation approaches. This link might be helpful: https://en.wikipedia.

org/wiki/Feature_scaling.

Normalizing the feature vector

This one is simple but sometimes may work well. Given a feature vectorx, the

normalized feature vector is given by˜x =

x

x2

. If a vector is an all-zero vector, define the normalized vector to also

be the all-zero vector (in practice, a useful trick is to add a very very small value to the norm of

x, so that we do not

need to worry about the division by zero error).

Min-max scaling for each feature

The above normalization is independent of the rest of the data. On the other

hand, min-max normalization scales each sample in a way that depends on the rest of the data. More specifically, for

each feature in the training set, we normalize it linearly so that its value is between 0 and 1 across all samples, and

in addition, the largest value becomes exactly 1 while the smallest becomes exactly 0. Then, we will apply the same

scaling parameters we get from training data to our testing instances.

Task: Fill in the code for the function data processing with transformation .

Report Item: Report the error rate of yourk nearest neighbor algorithm in the validation set fork = 4 using Euclidean

distance when data is using (1) Normalized featured vector, and (2) Min-max scaling featured vector.

Part 4.3 Different distance measurement (3pts)

In this part, we will change the way that we measure distances. We will work with the original (unnormalized) data

for simplicity, and continue the experiments from Part 4.1.

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 206,311评论 6 481
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 88,339评论 2 382
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 152,671评论 0 342
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 55,252评论 1 279
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 64,253评论 5 371
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,031评论 1 285
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,340评论 3 399
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,973评论 0 259
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 43,466评论 1 300
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,937评论 2 323
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,039评论 1 333
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,701评论 4 323
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,254评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,259评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,485评论 1 262
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,497评论 2 354
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,786评论 2 345

推荐阅读更多精彩内容