Object tracking is all about locating objects in a video sequence or in real time(real time tracking) and following the movement of the object.
In this project, I will present you the meanshift tracking method and its improvement camshift with the CUDA acceleration.
1. Meanshift and Camshift
Briefly speaking, the classical meanshift method contains three steps:
- calculate meanshift vector
- move the density estimation window
- repeat the above process until it converge
As you can see that the whole process is about finding the densest center. I will not go through the mathematical detail behind, you can find a good reference here: Introducation to Mean shift.
Follow this method, I built my first version of object tracking(without CUDA):
It looks just OK, but I had found a big problem that is my program lost his target very easily. I found that the problem is deal to the fixed searching window size, which means the searching window was big so that takes some background into calculation:
Or it was too small that just contains a subset of object:
The tracking is not constant in both case because we didn't really taking the density of object. And this caused another problem which is the we can't change the depth level. If we move the object forwards or backwards to the camera, the changing object size will not fit the fixed searching window.
By solving this problem, I get to camshift (continuous adaptive meanshift) which is an improvement of meanshift introduced by Gary Bradski in 1998.
Briefly speaking, camshift applied meanshift first. Once the meanshift converged, it recalculate the size of searching window. And again it starts the meanshift with the new window and previous location. This process ends until it meets certain precision.
By recalculating the window size in each iteration, we could find the best fit searching window for target object. This insured that the stable and constant tracking:
2. Parallel programming with CUDA
A typical CUDA program contains the following 4 steps:
- CPU allocates storage on GPU
- CPU copies input data to GPU
- CPU launches kernels on GPU to process the data
- CPU copies results back form GPU to CPU
The principle of writing a CUDA program is firstly write your program as if it runs on one single thread and than the GPU will run that program on many threads.
But in case of parallelism, we have to face some unusual problems in procedural programming. The first problem is how the threads are communicating with each other through memory.
Parallel communication models
The simplest model is call "map", which means that tasks read from and write to specific data elements:
The second model is "gather" because the threads gather many input data:
And we also have "scatter" where the task computes where to write output:
Next is "stencil" where tasks read input from a fixed neighborhood in an array:
And finally we have "transpose" that re-order the data element:
Let's make a little conclusion:
- map: one-to-one
- gather: many-to-one
- scatter: one-to-many
- stencil: several-to-one
- transpose: one-to-one
Memory model
CUDA has a 3 level memory model. A thread has it's own local memory. A block of threads has a shared memory which could be access by all threads inside the block. And the whole kernel has a global memory that is accessible for all contained threads.
Synchronization
As we define the block which contains many threads to compute part of the problem, we are facing the synchronization problem. In certain point of the program, threads should wait until others reached.
A simple technique CUDA applied is barrier, which the point that threads should stop ant wait. (syncthreads)
We've go through all important aspect of prallel programming, let's look at our Camshfit acceleration.
3. Camshift acceleration with CUDA
So the firstly thing we have to do, is to analyse which part of the program could be parallel. The process of Camshift is following:
- construct histogram
- back projection to obtain the color distribution
- meanshift util it converges
- recalculate the window size for next frame
- continue the process until no more frame
The histogram could be built simultaneously. Instead of iterate pixel by pixel, we could just launch many threads to compute the bin in the same time.
__global__ void rgb_to_xyY(
float* d_r,
float* d_g,
float* d_b,
float* d_x,
float* d_y,
float* d_log_Y,
float delta,
int num_pixels_y,
int num_pixels_x )
{
int ny = num_pixels_y;
int nx = num_pixels_x;
int2 image_index_2d = make_int2( ( blockIdx.x * blockDim.x ) + threadIdx.x, ( blockIdx.y * blockDim.y ) + threadIdx.y );
int image_index_1d = ( nx * image_index_2d.y ) + image_index_2d.x;
if ( image_index_2d.x < nx && image_index_2d.y < ny )
{
float r = d_r[ image_index_1d ];
float g = d_g[ image_index_1d ];
float b = d_b[ image_index_1d ];
float X = ( r * 0.4124f ) + ( g * 0.3576f ) + ( b * 0.1805f );
float Y = ( r * 0.2126f ) + ( g * 0.7152f ) + ( b * 0.0722f );
float Z = ( r * 0.0193f ) + ( g * 0.1192f ) + ( b * 0.9505f );
float L = X + Y + Z;
float x = X / L;
float y = Y / L;
float log_Y = log10f( delta + Y );
d_x[ image_index_1d ] = x;
d_y[ image_index_1d ] = y;
d_log_Y[ image_index_1d ] = log_Y;
}
}
The moment computing process is also parallelism, but in my project I didn't get this right. Maybe I will get this done later.
So let's look at the result:
4. Conclusion
In this project, I applied Camshift object tracking method with (part of) CUDA acceleration. The result is convincing, but there are still problems for future work:
- The color space is not a robust feature to extract the object. The program could get lost with similar background color.
- The parallel programming is complex, so I didn't success all the aspect in this program.
- Deal to time limit, an performance evaluation is not held in this project.