3.5 「Stanford Algorithms」O(n log n) Algorithm for Closest Pair 2 [Advanced - Optional]

Alright. So the plan for this video is to prove the correctness of the divide and

conquer closest to pair algorithm that we discussed in the previous video. So just

to refresh your memory, how does the outer algorithm work? Well, we're given endpoints

in the plane. We begin by sorting them, first by x-coordinate and then by

y-coordinate. That takes n log in time. Then we enter the main recursive divide

and conquer part of the algorithm. So what do we do? We divide the point set into the

left half and the right half, Q and R, then we conquer. We recursively compute

the closest pair in the left half of the point set Q. We recursively compute the

closest pair in the right half of the point set R. There is a lucky case where

the closest pair on the entire point set lies either all on the left or all on the

right. In that case, the closest pair is handed to us on a silver platter,

by one of the two recursive calls. But there remains the unlucky case where the closest

pair is actually split with one point on the left and one point on the right. So to

get our N log N running time bound, analogous to Merge Short in our inversion

counting, we need to have a linear time implementation of a subroutine which

computes the best, the closest pair of points, which is split, one on the left

and one on the right. Well, actually, we don't need to do quite that. We need to

do something only a little bit weaker. We need a linear time algorithm, which whenever

the closest pair in the whole point set is in fact split, then computes that split

pair in linear time. So let me now remind you of how that subroutine works.

So it has two basic steps. So first, there's a filtering step. So it looks at,

first of all, a vertical strip, roughly down the middle of the point set. And it

looks at, only at points which fall into that vertical strip. That was a subset of

the points that we called S sub Y, 'cause we keep track of them sorted by Y

coordinate. And then we do essentially a linear scan through SY. So we go through

the points one at a time, and then, for each point, we look at only the almost

adjacent points. So for each index I, we look only at J's that are between one and

seven positions further to the right, than I. So among all such points, we

compare them, we look at their distances. We remember the best such pair of points.

And then that's what we return from the count split pair subroutine. So we've

already argued, in the previous video, that the overall running time of the

algorithm is N log N. And what remains to prove correctness. And we also argued, in

the previous video, that correctness boils down to the following correctness claim.

In the sense that, if we can prove this claim, then the entire algorithm is

correct. So this is what remains. Our residual work is to provide a proof of

the correctness claim. What does it say? It says consider any split pair that is

one point p from the left side Q, capital Q, and another point little q drawn from

the right side of the point set capital R. And fur, further suppose that it's an

interesting split pair meaning that the distance between them's at most delta.

Here delta is recall the parameter pass to the count split pair subroutine, which is

the smallest distance between a pair of points all on the left or all on the

right. And this is the only case we're interested in. There's two claims. First

of all, for p and q, both members of the split pair survive the filtering step.

They make it into the sorted list S sub Y, and second of all, they will be considered by

that double for loop, in the sense that the positions of p and q in this array, S

sub Y, differ by at most seven. So that's the story so far. Let's move on to the

proof. So let's start with part A which is the easy part relatively of the claim. So

remember what we start with, our assumptions. We have a point p, let's

write it out in terms of the X coordinates, X1 and Y1, which is from the

left half of the point set. And we have a point q, which we'll call X2Y2, which

comes from the right half of the point set. And furthermore, we're assuming that

these points are close to each other. And we're gonna use that hypothesis over and

over again. So the Euclidean distance between p and q is no more than this

parameter delta. So, first, something very simple, which is that if you have two

points that are close in Euclidean distance, then both of their coordinates

have to be close to each other, right? If you have two points, and they differ by a

lot in one coordinate, then the Euclidean distance is gonna be pretty big as well.

So, specifically. By our hypothesis, that p and q have Euclidean distance less than

delta, it must be that the difference between their coordinates in absolute

value is no more than delta, and as well, the difference between their y-coordinates

is at most delta. Okay, and this is easy to see if you'd just return to the

definition of Euclidean distance that we reviewed at the beginning of the

discussion of closest points. Okay? So if your distance is at most delta, then in

each coordinate, you differ by at most delta as well. Now, what does A say?

[sound]. So proof of A. So what does part A of the claim assert? It asserts that p

and q are both members of SY, are both members of that vertical strip. So another

way of saying that is that the X coordinates of p and q, that is, the

numbers X1 and X2 both are within delta of Xbar. Remember, Xbar was in some sense

the median X coordinate. So the X coordinate of the N over two'th leftmost

point. So we're gonna do a proof by picture, so consider, forget about the Y

coordinates, that's of irrelevant right now, and just focus on the X coordinates

of all of these points. So on the one hand we have X bar. This is the X coordinate of

the N over two'th point to the left. And then there are the X coordinates which define

the left and the right borders of that vertical strip. Namely Xbar-delta and

Xbar+delta. And then somewhere in here are X1 and Y1, the X coordinates of the points

we care about, p and q. So a simple observation, so because p comes from the

left half of the point set, and Xbar is the rightmost X coordinate of the left

half of the point set, the X coordinate is at most Xbar. Right? So all points of Q

have X coordinate, at most, Xbar, in particular, p does. Similarly, since Xbar

is the rightmost edge of the left half of the point set, everything in the right half

of the point set has X coordinate, at least Xbar. So in particular, little q

does as well. So what does this mean. So this means x1, wherever it is, has to be

at the left of x bar. X2 wherever it is has to be to the right of x bar. What

we're trying to prove is that they're wedged in between x bar minus delta and x bar

plus delta. And the reason why that's true is because their x coordinates

also differ by at most delta. Okay, so what you should imagine is. You can imagine x1

and x2 are sort of people tied by a rope at the waist. And this rope has

length delta. So wherever x1 and x2 move, they're at most delta apart.

Furthermore x1, we just observed, can't move any farther to the right than Xbar.

So even if X1 moves as far to the right as it can, all the way to Xbar, X2, since

it's at most delta away, tied by the waist, can't extend beyond X bar+ delta. By

the same reasoning, X2 can't move any further to the left than Xbar, X1 being

tied to the waist to X2, can never drift further to the left than Xbar minus delta.

So that's the proof that X1 and X2 both lie within this region, and that defines

the vertical strip. So that's part A. If you have any split pair whose distance between

them is less than delta, they both have to wind up, in this vertical strip. And

therefore wind up in the filtered set, the proof set, S sub Y. So that's part A of

the claim. Let's now move to Part B. Recall what part B asserts. It says that

the points p and q, this split pair that are distance only delta apart. Not only do

they wind up in this sort of filtered set SY, but in fact, they are almost adjacent

in SY, in the sense that the indices in the array differ by, at most, seven

positions. And this is a part of the claim that is a little bit shocking. Really what

this says is that we're getting away with more or less a variant of our one

dimensional algorithm. Remember when we wanted to find the closest pair of points

on the line, all we had to do was sort them by their single coordinate and then

look at consecutive pairs and return the best of those consecutive pairs. Here what

we're saying is really, once we do a suitable filtering focus on points in this

vertical strip, then we just go through the points according to their Y

coordinate. And okay, we don't just look at adjacent pairs. We look at pairs within

seven positions, but still we basically do a linear sweep through the points in SY.

According to their Y coordinate and that's sufficient to identify the closest split

pair. So why on earth will this be true. So our workhorse in this argument will be

a picture which I am going to draw on next. So I'm going to draw eight boxes,

which have a height and width delta over two. So here, delta is the same parameter

that gets passed to the closest split pair subroutine. And it's also the same

delta which we're assuming p and q are closer to each other than, right? So

that's, remember, that's one of our hypotheses in this claim. The distance

between p and q is strictly less than delta. So we're gonna draw eight delta

over two boxes. And they're gonna be centered at x bar. So, this same center of

the vertical strip that defines S Y. And the bottom is going to be the smaller of the

Y-coordinates of the points p and q. So it might be p, it might be q. It

doesn't really matter. But the bottom is going to be the smaller of the two. So

the picture then looks as follows. So the center of these collections of eight

boxes, X bar, the bottom is the minimum of Y1, Y2. We're gonna have two rows and four

columns. And needless to say, we're drawing this picture just for the sake of

this correctness proof, right? This picture is just a thought experiment in

our head. We're just trying to understand why the algorithm works. The algorithm, of

course, does not draw these boxes. The subroutine, the, closest split pair

subroutine is just that pseudo code we saw in the previous video. This is just to

reason about the behavior of that subroutine. Now looking ahead, I'll make

this precise in two lemmas that are about to come up, what's going to be true

is the following. So, either p or q is on this bottom line, right? So we define the

bottom to be the lower Y coordinate of the two. So maybe, for example, q is the one

that has the smaller Y coordinate, in which case, is gonna be, somewhere, say,

down here. P, you remember, is from the left half of the point sets. So p is maybe

gonna be here or something. And we're gonna argue that both p and q have to be

in these boxes. Moreover, we're gonna argue that these boxes are sparsely

populated. Every one contains either zero or one point of the array S sub Y. So,

what we're gonna see is that there's at most eight points in this picture, two of

which are p and q, and therefore, if you look at these points sorted by Y

coordinate, it has to be that they're within seven of each other, the difference

of indices is no more than seven. So, we're gonna make those two statements

precise one at a time by the following two lemmas. Let's start with lemma one. Lemma

one is the easy one. And it states that all of the points of S sub Y, which show up

in between the Y coordinates of the points we care about p and q have to appear in

this picture, they have to lie in one of these eight boxes. So we're going to argue

this in two steps. First, we're going to argue that all such points have to have Y

coordinates within the relevant range of this picture between the minimum of Y1 and

Y2 and delta more than that, and secondly that they have to have X coordinates in

the range of this picture, namely between X bar minus delta and X bar plus delta. So

let's start with Y coordinates. So again, remember this key hypothesis we have,

okay. We're dealing with a split pair p-q that are close to each other. The distance

between X and Y is strictly less than delta. So the very first thing we did at

the beginning of this proof is we said well, if their Euclidean distance is less

than delta then they have to differ by at most delta in both of their coordinates,

in particular in their Y coordinate. Now remember whichever is lower of p and

q, whichever one has a smaller y coordinate is precisely at the bottom of

this diagram. For example, if q is the one with the smaller y coordinate, it might be

on the black line right here. So that means in particular x has y coordinate no

more than the top part of this diagram. No more than delta bigger than q. And of

course all points with Y coordinates in between them are equally well wedged into

this picture. So that's why all points of SY with a Y coordinate between those of p

and q have to be in the range of this picture, between the minimum of the two Y

coordinates and delta more than that. Now what about horizontally? What about the X

coordinates? Well this just follows from the definition of S sub Y. So remember,

S sub Y are the points that fall into this vertical strip. How did we define the

vertical strip? Well it had center Xbar, and then we fattened it by delta on both

sides. So just by definition, if you're an SY, you've gotta have X coordinates in the

range of this picture. X delta plus minus, sorry, xbar plus minus delta. So that

completes the proof of the lemma. So this is not. This is just a lemma. So I'll use a

lower case qed. Remember this is just a step toward proving the overall

correctness claim. But this is a good step. And again, the way you think about

this is it says we draw this boxes. We know that either p or q is at the bottom.

The other one is going to be on the other side of the black line x bar but will be

in some other box so perhaps maybe p is here and the lemma is saying that all the

relevant points of SY have to be somewhere in this picture. Now remember in our

double for loop we only search seven positions away, so the concern is that

this is a sorta super highly populated collection of eight boxes. That's the

concern, but that's not going to be the case and that's exactly what lemma two is

going to say. Not only do the points between p and q in Y coordinates show up

in this diagram, but there have to be very few. In particular, every box has to be

sparse, with population either zero or one. So, let's move on to lemma two. So formally

the claim is [sound], we have at most one point of the point set in each of these

eight boxes. And this is finally where we use, in a real way, the definition of

delta. This is where we finally get the payoff from our realization long ago, that

when defining the closest split pair subroutine, we only really need to be

correct in the unlucky case. In the case we're not handed the right answer by one

of our recursive calls. We're finally gonna use that fact in a fundamental way.

So we're gonna proceed by contradiction. So we're going to think about what happens

if there are two points in a single box and from that we'll be able to derive a

contradiction. So, call the points that wind up in the same box A and B. So, to

the contrary, suppose A and B lie in the same box. So, maybe this is A here, and

this is B here, at antipodal corners of this particular box. So from this

supposition, we have two consequences. First of all. I claim that A and B lie on

the same side of the point set. They're either both in the left side, Q or they're

both in the right side, R. So why is this true? Well it's because every box lies either

entirely on the left half of the point set or on the right half of the point

set. Recall how we define x bar. X bar is the x coordinate of the right most point

amongst the left half of the point set capital Q. So therefore points with x

coordinate at most x bar have to lie inside the left half Q. Points with x

coordinates at least x bar have to lie inside the right half of the point

set capital R. So that would be like in this example. A and b both lie in a box

which is to the right of x bar. So they both have to come in the right half

of the point set capital R. This is one place we are using that there are no ties

in X coordinate, so if there's a point with X, X coordinate or X bar, we can

count it as part of the left half. So every box, by virtue of being either to

the left of xbar or to the right of xbar, can only contain points from a common half

of the point set. So that's the first consequence of assuming that you have two

points in the same box. The second consequence is, because the boxes are

small, the points gotta be close. So, if A and B co-habitate a box, how far could

they be from each other? Well, the farthest they could be is like I've drawn

in the picture, with the points A and B. Where they're at opposite corners of a

common box. And then you bust out Pythagorean's Theorem, and what do you

get? You get that the distance between them is delta over two, the side of the

box times root two. And what's relevant for us is this is strictly less than

delta. Okay? But, now, here is where we use, finally, the definition of delta.

Consequences one and two in tandem, contradict how we define delta. Remember

what delta is. It's as close as two pair of, a pair of points can get if they both

lie on the left side of the point set, or if they both lie on the right side of the

point set. That is how we defined it. As small as a pair of points on a common half

can get to each other. But what have we just done? We've exhibited a pair A and B

that lie on the same half of the point set, and are strictly closer than delta.

So that contradicts the definition of delta. So that completes the proof of lemma

two. Let me just make sure we're all clear on why having proved lemma one and lemma two

we're done with the proof part B of the claim and therefore the entire claim

because we already proved part one, now a long time ago.

So let's interpret the 2 lemmas in the context of our picture that we had all

throughout. In terms of the eight boxes of side length delta over two by

delta over two. So again, whichever is the lower of p and q, and again let's just for

the sake of concreteness say it's q, is at the bottom of the picture. The other point

is on the other half of the line Xbar, and is in one of the other boxes. So, for

example, maybe p is right here. So lemma one says that every relevant point, every

point that survives the filtering and makes it into SY, by virtue of being in

the vertical strip, has to be in one of those boxes, okay? If it has Y coordinate

in between p and q. Lemma two says that you can only have one point in each of these boxes

from the point set, so that's gonna be at most eight total. [sound] So combining

them. Lemmas one and two imply, that there are almost eight points in this picture

and that includes p and q because they also occupy two of eight boxes. So in the

worst case, if this is as densely populated as could possibly be, given

lemmas one and two, every other box might have a point and perhaps every one of

those points has a Y coordinate between p and q. But this is as bad as it gets.

Any point of the strip with Y coordinate between p and q occupies a box.

So, at most, there are these six wedged in between them. What does this mean? This

means if from q you look seven positions ahead in the array, you are

guaranteed to find this point p. So a split pair with distance less than delta

is guaranteed to be identified by our double for loop. Looking seven positions

ahead in the sorted array SY is sufficient to identify, to look at every conceivably

interesting split pair. So that completes the assertion B of the correctness

claim and we're done. That establishes that this supremely clever

divide and conquer algorithm is indeed a correct O(nlog(n)) algorithm that computes the

closest pair of a set of n points in the plane.