this quiz is the second one, and the reasoning is just by analogy with our
previous algorithms for merge sort and for counting inversions. So, what is all of
the work that we would do in this algorithm or we do have this preprocessing
step we call merge sort twice, we know that's n log n, so we're not going to have
a running time better than n log n cause we sort at the beginning. And then, we
have a recursive algorithm with the following flavor, it makes two recursive
calls. Each recursive call is on a problem of
exactly half the size with half the points of the original one. And outside of the
recursive calls, by assumption, by, in the problem, we do a linear amount of work in
computing the closest split pair. So we, the exact same recursion tree which proves
an n log n bound for merge sort, proves an n log n bound for how much work we do
after the preprocessing step, so that gives us an overall running time bound of
n log n. Remem ber, that's what we were shooting for. We were working n log n
already to solve the one-dimensional version of closest pair and the goal of
these lectures is to have an n log n algorithm for the 2D versions. So this
would be great. So in other words, the goal should be to have a correct linear
time implementation of the closest split pair subroutine. If we can do that, we're
home-free, we get the desired n log algorithm. Now, I'm going to proceed in a
little bit to show you how to implement closest split pair, but before I do that,
I want to point out one subtle, the key idea, which is going to allow us to get
this linear time correct implementation. So, let me just put that on the slide. So,
the key idea is that we don't actually need a full-blown correct implementation
of the closets split pair subroutine. So, I'm not actually going to show you a
linear time subroutine that always correctly computes the closets split pair
of a point set. The reason I'm going to do that is that's actually a strictly harder
problem than what we need to have a correct recursive algorithm. We do not
actually need a subroutine that, for every point sets, always correctly computes the
closest split pair of points. Remember, there's a lucky case and there's an
unlucky case. The lucky case is where the closest pair in the whole point set P
happens to lie entirely in the left half of the points Q or in the right half of
the points R In that lucky case, we, one of our recursive calls will identify this
closest pair and hand it over to us on a silver platter. We could care less about
the split pairs in that case. We get the right answer without even looking at the
split pair, pairs. Now, there's this unlucky case where the split pairs happens
to be the closest pair of points. That is when we need this linear time subroutine,
and only. then, only in the unlucky case where the closest pair of points happens
to be split. Now, that's in some sense, a fairly trivial observation, but, there's a
lot of ingenuity here i n figuring out how to use that observation. The fact that we
only need to solve a strictly easier problem and that will enable the linear
time implementation that I'm going to show you next. So now, let's rewrite the high
level recursive algorithm slightly to make use of this observation that the closest
split pair subroutine only has to operate correctly in the regime of the unlucky
case, when in fact, the closest split pair is closer than the result of either
recursive call. So I've erased the previous steps 4 and 5, that, but we're
going to rewrite them in a second. So, before we invoke close split pair, what
we're going to do is we're going to see how well did our recursive calls do. That
is, we're going to define a parameter little delta, which is going to be the
closest pair that we found or the distance of the closest pair we found by either
recursive call. So the minimum of the distance between P1 and Q1, the closest
pair that lies entirely on the left, and P2Q2, the closest pair that lies entirely
on the right. Now, we're going to pass this delta information as a parameter into
our closest split pair subroutine. We're going to have to see why on earth that
would be useful and I still owe you that information, but, for now, we're just
going to pass delta as a parameter for use in the closest split pair. And then, as
before we just do a comparison between the three candidate closest pairs and return
the best of the, of the trio. And so, just so we're all clear on, on where things
stand, so what remains is to describe the implementation of closest split pair, and
before I describe it, let me just be crystal clear on what it is that we're
going to demand of the subroutine. What do we need to have a correct in o of n log n
time closest pair algorithm. Well, as you saw on the quiz, we want the running time
to be o of n always, and for correctness, what do we need? Again, we don't need it
to always compute the closest split pair, but we need it to compute the closest
split pair in the events that there is a split pair of distance strictly less than
delta, strictly better than the outcome of either recursive call. So now that we're
clear on what we want, let's go ahead and go through the pseudocode for this closest
split pair subroutine. And I'm going to tell you upfront, iIt's going to be fairly
straightforward to figure out that the subroutine runs in linear time, o of n
time. The correctness requirement of closest split pair will be highly
non-obvious. In fact, after I show you this pseudo you're not going to believe
me. You're going to look at the pseudocode and you'd be like, what are you talking
about? But in the second video, on the closest pair lecture, we will in fact show
that this is a correct sub-routine. So, how does it work? Well, let's look at a
point set. So, the first thing we're going to do is a filtering step. We're going to
prune a bunch of the points away and so to zoom in on a subset of the points. And the
subset of the points we're going to look at is those that lie in a vertical strip,
which is roughly centered in the middle of the point set. So, here's what I mean. By
center dot, we're going to look at the middle x coordinate. So, let x bar be the
biggest x coordinate in the left half, so that is in the sorted version of the
points by x coordinate, we look at the n over two smallest ex-coordinate. So, in
this example where we have six points, all this means is we draw, we imagine drawing
a line between the third points, so that's going to be x bar, the x coordinate of the
third point from the left. Now, since we're passed as input, a copy of the
points sorted by x coordinate, we can figure out what x bar is in constant time.
Just by accessing the relevant entry of the array, px. Now, the way we're going to
use this parameter delta that we're passed, so remember what delta is. So
before we invoke the closest split pair subroutine in the recursive algorithm, we
make our two recursive calls, we find the closest pair on the left, the closest pair
on the right, and delta is whatever the smaller of those two distances are. So
delta is the parameter that controls whether or not we actually care about the
closest split pair or not, we care if and only if there is a split pair at distance
less than delta. So, how do we use delta? Well, that's going to determine the width
of our strip, so the strip's going to have width 2 delta, and it's going to be
centered around x. And the first thing we're going to do is we're going to
ignore, forevermore, points which do not line in this vertical strip.
So the rest of the algorithm will operate only on the subset of p, the subset of the
points that lie on the strip, and we're going to keep track of them sorted by y
coordinate. So the formal way to say that they line the strip, is that they have x
coordinate in the interval with lower endpoint x bar minus delta and upper
endpoint x bar plus delta. Now, how long does it take to construct this set Sy
sorted by y coordinate? Well fortunately, we've been passed as input a sorted
version of the points Py So to extract Sy from Py, all we need to do is a simple
linear scan through p y checking for each point where its x coordinate is. So this
can be done in linear time. Now, I haven't yet shown you why it's useful to have this
sorted set as y, but if you take it on faith that it's useful to have the points
in this vertical strip sorted by y coordinate. You now see why it was useful
that we did this merge sort all the way at the beginning of the algorithm before we
even underwent any recurssion. Remember, what is our running time goal for closest
split pair? We want this to run in linear time, that means we cannot sort inside the
closest split pair subroutine. That would take too long. We want this to be in
linear time. Fortunately, since we sorted once and for all at the beginning of the
closest pair algorithm, extracting sorted sublists from those sorted lists of points
can be done, done in linear time, which is within our goals here. Now, it's the rest
of t he subroutine where you're never going to believe me that it does anything
useful. So, I claim that essentially with a linear scan through Sy, we're going to
be able to identify the closest split pair of points in the interesting, unlucky case
where there is such a split pair with distance less than delta. So here's what I
mean by that linear scan through Sy. So as we do the scan, we're, we're going to keep
track of the closest pair of points of a particular type that we've seen so far.
So, let me introduce some variables to keep track of the best candidate we've
seen so far. There's going to be a vary, variable best which will initialize to be
delta. Remember, we're uninterested in split pairs unless they have distance
strictly less than delta. So, and then we're going to keep track of the points
themselves, so we'll initialize the best pair to be null. Now, here is the linear
scan. So we go through the points of Sy in order y coordinate. Okay, well, not quite
all the points of Sy. We stop at the eighth to last point and you'll see why in
a second. And then, for each position I of the array Sy, we investigate the seven
subsequent points of the same array Sy. So for j going from one to seven, we look at
the Ith, and I plus jth entry of Sy. So if sy looks something like this array here,
in any given point in this double for loop, we're generally looking at an index
I, a point in this, in this of the array, and then some really quite nearby point in
the array I plus j, because j here's going to be at most seven. Okay? So we're
constantly looking at pairs in this array, but we're not looking at all pairs of all.
We're only looking at pairs that are very close to each other, within seven
positions of each other. And what do we do for each choice of i and j? Well, we just
look at those points, we compute the distance, we see if it's better than all
of the pairs of points of this form that we've looked at in the past and if it is
better, then we remember it. So we just remember the best, ie c losest pair of
points, of this particular type for choices of i and j of this form. So in
more detail, if the distance between the current pair of points of p and q is
better than the best we've seen so far, we reset the best pair of points to be equal
to p and q, and we reset the best distance, the closest distance seemed so
far to be the distance between p and q and that's it. Then, once this double for loop
terminates, we just return it the best pair. So one possible execution of closest
split pair is that it never finds a pair of points, p and q, at distance less than
delta. In that case, this is going to return null and then in the outer call. In
the closet pair, obviously, you interpret a null pair of points to have an infinite
distance. So if you call closest split pair, and it doesn't return any points,
then the interpretation is that there's no interesting split pair of points and you
just return the better of the results of the two recursive calls p1Q1 or P2Q2. Now,
as far as the running time of the subroutine, what happens here? Well, we do
constant work just initializing the variables. Then notice that the number of
points in Sy, well in the worst case, you have all of the points of P. So, it's
going to be the most endpoints, and so, you do a linear number of iterations in
the outer for loop. But here is the key point, in the inner for loop, right,
normally double for loops give rise to quadratic running time, but in this inner
for loop we only look at a constant number of other positions. We only look at seven
other positions and for each of those seven positions, we only do a constant
number of work. Right? We just, we want to compare distance and make a couple other
comparisons, and reset some variables. So for each of the linear number of outer
iterations, we do a constant amount of work, so that gives us a running time of o
of n for this part of the algorithm. So as I promised, analyzing the running time of
this closest split pair subroutine was not challenging. We just , in a
straightforward way, looked at all the operations. Again, because in the key
linear scan, we only do constant work per index, the overall running time is big O
of n, just as we wanted. So this does mean that our overall recursive algorithm will
have running time o of n log n. What is totally not obvious and perhaps even
unbelievable, is that this subroutine satifies the correctness requirements that
we wanted. Remember, what we needed, we needed that whenever we're in the unlucky
case, whenever, in fact, the closest pair of points in the whole point set is split,
this subroutine better find it. So, but it does, and that's being precise in the
following correctness claim. So let me rephrase the claim in terms of an
arbitrary split pair, which has distance less than delta, not necessarily the
closest such pair. So suppose, there exists, a p on the left, a point on the
left side and a point on the right side so that is a split pair and suppose the
distance of this pair is less than Q. Now, there may or may not be such a pair of
points, PQ.. Don't forget what this parameter delta means. What delta is, by
definition, is the minimum of d of p1q1, for p1q1 is the closest pair of points
that lie entirely in the left half of the point set Q and d of p2q2, or similarly,
p2Q2 is the closest pair of points that entirely on the right inside of R. So, if
there's a split pair with distance less than delta, this is exactly the unlucky
case of the algorithm. This is exactly where neither recursive call successfully
identifies the closest pair of points, instead that closest pair is a split pair.
On the other hand, if we are in the lucky case, then there will not be any split
pairs with distance less than delta, because the closest pair lies either all
on the left or on the right, and it's not split. But remember, we're interested in
the case where there is a split pair that has a distance less than delta where there
is a split pair that is the closest pair. So the claim has two parts. The first
part, part A, says the following. It says that if there's a split pair p and, and q
of this type, then p and q are members of Sy. And let me just sort of redraw the
cartoon. So remember what Sy is. Sy is that vertical strip. And again, the way we
got that is we drew a line through a median x coordinate and then we fattened
it by delta on either side, and then, we focused only on points that lie in the
vertical strip. Now, notice our counts split pair subroutine, if it ever returns
a pair of points, it's going to return a pair of points pq that belong to Sy.
First, it filters down to Sy, then it does a linear search through Sy. So if we want
to believe that our subroutine identifies best split pairs of points, then, in
particular, such split pairs of points better show up in Sy, they better survive
the filtering step. So that's precisely what part A of the claim is. Here's part B
of the claim and this is the more remarkable part of the claim, which is
that p and q are almost next to each other in this sorted array, Sy. So they're not
necessarily adjacent, but they're very close, they're within seven positions away
from each other. So, this is really the remarkable part of the algorithm. This is
really what's surprising and what makes the whole algorithm work. So, just to make
sure that we're all clear on everything, let's show that if we prove this claim,
then we're done, then we have a correct fast implementation of a closest pair
algorithm. I certainly owe you the proof of the claim, that's what the next video
is going to be all about, but let's show that if the claim is true, then, we're
home-free. So if this claim is true, then so is the following corollary, which I'll
call corollaryl 1. So corollary 1 says, if we're in the unlucky case that we
discussed earlier, if we're in the case where the closest point and the whole
points of p does not lie both on the left, does not lie both on the right, but rather
has one point on the left and one on the right but as it's a split pair, th en in
fact, the count split pair subroutine will correctly identify the closest split pair
and therefore the closest pair overall. Why is this true? Well what does count
split pair do? Okay, so it has this double for loop, and thereby, explicitly examines
a bunch of pairs of points and it remembers the closest pair of all of the
pairs of points that it examines. What does this, so what are the criteria that
are necessary for count split pair to examine a pair point? Well, first of all,
the points p and q both have to survive the filtering step and make it into the
array Sy. Right? So count split pair only searches over the array Sy. Secondly, it
only searches over pairs of points that are almost adjacent in Sy, that are only
seven positions apart, but amongst pairs of points that satisfy those two criteria,
counts but pair will certainly compute the closest such pair, right? It just
explicitly remembers the best of them. Now, what's the content of the claim?
Well, the claim is guaranteeing that every potentially interesting split pair of
points and every split pair of points with distance less than delta meets both of the
criteria which are necessary to be examined by the count split pair
subroutine. So first of all, and this is the content of part A, if you have an
interesting split pair of points with distance less than delta, then they'll
both survive the filtering step. They'll both make it into the array Sy., part A
says that. Part B says they're almost adjacent in Sy. So if you have an
interesting split pair of points, meaning it has distance less than delta, then they
will, in fact, be at most seven positions apart. Therefore, count split pair will
examine all such split pairs, all split pairs with distance less than delta, and
just by construction, it will compute the closest pair of all of them. So again, in
the unlucky case where the best pair of points is a split pair, then this claim
guarantees that the count split pair will compute the closest pair of points.
Therefore, having h andled correctness, we can just combine that with our earlier
observations about running time and corollary 2 just says, if we can prove the
claim, then we have everything we wanted. We have a correct O of n log n
implementation for the closest pair of points. So with further work and a lot
more ingenuity, we've replicated the guarantee that we got just by sorting for
the one-dimensional case. Now again, these corrollaries hold only if this claim is,
in fact, true and I have given you no justification for this claim. And even the
statement of the claim, I think, is a little bit shocking. So if I were you I
would demand an explanation for why this claim is true, and that's what I'm going
to give you in the next video.
3.4 「Stanford Algorithms」O(n log n) Algorithm for Closest Pair 1 [Advanced - Optional] - Part 2
©著作权归作者所有,转载或内容合作请联系作者
- 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
- 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
- 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...