PerformanceBenchmark of Locality Sensitity Hashing and KD-Tree Algorithm

1 The Purposes

Get familiar with the common ANN algorithms, such as KD-Tree and LSH
Learn the implementation of LSH and other related coding skills
Analysis the performance of KD-Tree and LSH under different dimensions

2 The Principles

2.1 ANN

Efficient and accurate methods for Approximate Near-NearestNeighbor (ANN) queries are important for large-scale datasetsapplications such as image feature matching, recommendation system,DNA sequencing and so on.
Popular ANN methods include randomized kd-tree(KD-Tree),hierarchical k-means(HKM) and locality-sensitive hashing(LSH).

2.1.1 KDT

KDT partition a vector space by recursively generatinghyperplanes and cut along coordinates where there is maximal variancein the data. KD-Tree is good for searching in low-dimensional spaceswhile its efficiency decreases as dimensionality grows over than 10generally.

Building a kd-tree: O(N*logN) for time and O(N) for space
Nearest neighbor search: O(logN)
M nearest neighbors: O(M*logN)

2.1.2 LSH

LSH algorithm map similar input items to the same hash code withhigher probability than dissimilar items. For a given key size K, wegenerate randomized hyperplanes and check if a point P is above orbelow the hyperplanes. Assigning 0/1 based on below/above checking.And then perform harmming distance search on K bit for approximatenearest neighbors.

Build a LSH table: O(N*K) for time and O(N) for space, K is the key size
Nearest neightbor search: O(1) for the best situation

2.2 Related Open Source Projects

FLANNis a library for performing fast approximate nearest neighborsearches in high dimensional spaces. It contains a collection ofalgorithms we found to work best for nearest neighbor search and asystem for automatically choosing the best algorithm and optimumparameters depending on the dataset.

LsHash is an open source project on Github and implement a fast pythonlocality sensitive hashing with persistance support.

Webenchmark our KD-Tree performance based on FLANN and pyflann project.The dynamic library libflann.so is compiled and linked bypyflann python bindings. Our Lsh algorithm is based on LsHash andsome modifications are made.

2.3 Datasets

2.3.1 SIFT10K

SIFT10K dataset contains 128 dimensionalities of SIFT feature. Itcomprises up to 10 thousand of data vectors and one hundred of queryvectors. For each query vector, the ground truth file is also givenwith top k=100 nearest neighbors.

2.3.3 SIFT1M

SIFT1M dataset contains 128 dimensionalities of SIFT feature. Itcomprises up to 1 million of data vectors and 10 thousand of queryvectors. For each query vector, the ground truth file is also givenwith top k=100 nearest neighbors.

2.3.3 GIST1M

GIST dataset contains 960 dimensionalities of GIST feature. Itcomprises up to 1 million of data vectors and one thousand of queryvectors. For each query vector, the ground truth file is also givenwith top k=100 nearest neighbors.

2.4 Measure

We choose 3 main factors to benchmark performance of KD-Tree andLSH. Precesion can be seen as a measure of exactness or quality andtime/memory cost are also taken into account.
For each expriment, we will plot the two (x,y) lines underdifferent different algorithms, where y is the search time gain overlinear search time and x is the corresponding precision. Theprecision is the average precision of all query points.
Regarding expriment figure may like Fig.1 :

Fig. 1: respecting expriment figure on comparision of kd-tree and lsh algorithms. IT’S NOT ACTUAL EXPRIMENTAL DATA!

3 The Procedures

This is the machine configuration:

Ubuntu 16.04, 64 bit, with Intel Core i7-4970K 4.0GHz x8 and 16GB RAM
Main programming language: Python2.7 & C11++
FLANN 1.8.4 and pyflann bindings
LSH implementation via python based on open source LsHash

3.1 Data Dimensionality

The Startdard dataset SIFT1M contains 128 dimensionality of SIFTfeature and GIFT1M comprises 960 dimensionality of GIST feature. Wecompare the each performance of KD-Tree and LSH on dimension 128, 960respectively.

For KD-Tree, the parameters include Kd-tree number K and how manyleafs to visit when searching for neighbors, namely Checks. For LSH,the parameters include table number L and key size.

For each dimension, we will calculate the distribution of(precision, search time gain over linear) and plot it using pythonmatplotlib.
Fig.2 and Fig.3 are the respecting expriement result.

Fig. 2: (precision, search time gain over linear) distribution on SIFT dataset

Fig. 3: (precision, search time gain over linear) distribution on GIST dataset

Besides, we will calculate the memory cost on both 128 and 960dimentionality.

3.2 Dataset Size

We select dataset size of 10K, 1M from standard dataset SIFT10Kand SIFT1M. We discard the GIST1M dataset as the GIST featuredimension varies. For better comparision, SIFT1B need to be takeninto account. However, it’s too large to run on my computer atpresent.
In this expriment, we will vary the tree number K, search leafsChecks for KD-Tree algorithm and table number L, key Size for LSHalgorithm to get the disbritution of (search time gain over linear,precision) under different dataset size.
Fig.2 and Fig.3 are the respecting expriement result.

Fig. 5: (precision, search time gain over linear) distribution on SIFT1M dataset

Fig. 4: (precision, search time gain over linear) distribution on SIFT10K dataset

Besides, we will calculate the memory cost on both 10K and 1Mdataset.

3.3 Varying k

The ground truth dataset comprises top k=100 features for eachquery. To compare specific k performance of KD-Tree and LSHalgorithms, we select k=1, 3, 30, 100. We choose SIFT1M as thedataset and discard SIFT10K and GIST1M to keep the other dimensionsame other than different k.
In this expriment, we will vary the tree number K, search leafsChecks for KD-Tree algorithm and table number L, key Size for LSHalgorithm to get the disbritution of (search time gain over linear,precision) under different dataset size.

Fig. 6: (precision, search time gain over linear) distribution on SIFT1M dataset with query k=1

Fig. 7: (precision, search time gain over linear) distribution on SIFT1M dataset with query k=3

Fig. 9: (precision, search time gain over linear) distribution on SIFT1M dataset with query k=100

Fig. 8: (precision, search time gain over linear) distribution on SIFT1M dataset with query k=30

4 The Results

To be continued...

5 The Bibliography

[1]MariusMuja and David G. Lowe: Nearest Neighbor Algorithms for High Dimensional Data. Pattern Analysis and Machine Intelligence (PAMI), Vol. 36, 2014.

[2]A. Gionis, P. Indyk, and R. Motwani. Similarity search in highdimensions via hashing. In Proc. Int’l Conf. on Very Large DataBases (VLDB), 1999. 1, 2, 4

[3]Qin Lv, William Josephson, Zhe Wang, Moses Charikar, Kai Li.Multi-Probe LSH: Efficient Indexing for High-Dimensional SimilaritySearch. Proceedings of the 33rd International Conference on VeryLarge Data Bases (VLDB). Vienna, Austria. September 2007

[4]Marius Muja, David Lowe. FLANN - Fast Library for Approximate NearestNeighbors User Manual. January 24, 2013

[5]C. Silpa-Anan and R. Hartley. Optimised KD-trees for fast imagedescriptor matching. In Proc. Computer Vision and Pattern Recognition(CVPR), 2008. 1, 2

[6]IoannisZ. Emiris and DimitriNicolopoulos. Randomizedkd-trees for Approximate Nearest Neighbor Search.

November28, 2013

[7]JingdongWang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji. Hashing forSimilarity Search: A Survey.

arXiv:1408.2927v1[cs.DS] 13 Aug 2014

[8]MalcolmSlaney, Yury Lifshits and Junfeng He. Optimal Parameters for Locality-Sensitive Hashing.Proceedings of the IEEE, Vol. 100, No. 9, September 2012

[9]Nearest neighbor search. Wikipedia,https://en.wikipedia.org/wiki/Nearest_neighbor_search. Accessed 12 Oct 2016

[10]FLANN-Fast Library for Approximate Nearest Neighbors. University ofBritsh, http://www.cs.ubc.ca/research/flann/.Access 14 Oct 2016

[11]ALGLIB User Guide. ALGLIB Project,http://www.alglib.net/other/nearestneighbors.php.Accessed 14 Oct 2016

[12]Jason. LSH 算法框架分析. Web, http://blog.jasonding.top/2018/01/01/MLStick/ http://www.cnblogs.com/hxsyl/p/4626092.html.Access 11 Oct 2016

[13]坏人. LSH位置敏感哈希算法. CSDN, http://blog.csdn.net/u013378306/article/details/52473176,8 Sep 2016.

最后编辑于：2017.12.04 06:18:40

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 199,711评论 5赞 468
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 83,932评论 2赞 376
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 146,770评论 0赞 330
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 53,799评论 1赞 271
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 62,697评论 5赞 359
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,069评论 1赞 276
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,535评论 3赞 390
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,200评论 0赞 254
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,353评论 1赞 294
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,290评论 2赞 317
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,331评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,020评论 3赞 315
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,610评论 3赞 303
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,694评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 30,927评论 1赞 255
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,330评论 2赞 346
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 41,904评论 2赞 341