Project 4-Computer Vision-Image Classification of Furniture & Home Goods

Computer Vision-Image Classification of Furniture & Home Goods

Group Members:
Li Qin, lemeoqin@gwu.edu GWID: G24923341
Meixin Wang, meixin_wang@gwu.edu GWID: G23968837

This is a class project for CSCI 4527/6527 Computer Vision at George Washington University.

1. Introduction

1.1. Problem and real-world application

The boom of e-commerce enables consumers to shop everywhere. Shoppers tend to move from offline to online stores for the convenience of searching for massive goods and looking for the best deal.

Online shopping represents challenges in indexing and searching, which requires accurate and fast recognition of product categories. In traditional e-commerce, tags are defined and added by sellers. The main problem of manual classification is, on the one hand, energy and effort spent on processing a mass of data. On the other hand, the personally defined attribute may not sufficiently characterize all the visual style elements. Especially on platforms like eBay, the same item is listed with different tags by different users.

The problems of labor cost and label inconsistency can be addressed by automatic image recognition and classification realized by deep learning. However, many of today’s recognition machines are confronted with challenges in real-world applications, where fine-grained visual categorization is required. It is difficult for the machine to perceive subtle differences between attribute sub-categories, such as stock pot vs. saucepan, yet these differences could be important for shopping decisions. Tackling issues like this are also the subject of iMaterialist Challenge (Furniture) at FGVC5 from Kaggle.

Lots of applications can be enhanced by applying fine-grained furniture classification:

More specific searching function. For example, users of Craigslist can search for “antique Chinese dish set” rather than “plate” because more tags can be added to a large amount images automatically in computer vision.
Similar style retrieval. In Taobao, the online store operated by Alibaba, the “look for similar items” function is popular, where shoppers can find similar items in popular TV shows. Automatic labeling can provide accurate result based on subtle visual differences.
Interior design recommendation. Amazon can recommend loft bed to shoppers who have viewed images of kids plate in online store and market among target consumers.

1.2. Example Output

For example, on Craigslist, after seller upload one image, the fine-grained tag is generated automatically. The images will be listed in different sub-categories of one attribute to provide more accurate searching and traversal.

Figure. 1 Images show different sub-categories of one attribute

2. Related Work

Image classification is the task of predicting the label of an input image from a fixed set of categories. This is one of the core problems in Computer Vision. Challenges associated with this task include viewpoint variation, scale variation, deformation, occlusion, illumination conditions, background clutter and intra-class variation.

Intra-class variation refers to relatively broad subclasses within one class of interest, which can be coped with by fine-grained classification. In contrast to general object recognition task, it is more efficient to learn differences in critical parts of the same class to discriminate sub-classes rather than learning the whole image. [1]

2.1 A Review of CNN

Convolutional Neural Network (CNN) is a kind of deep and feed-forward neural networks. It attracts many attentions nowadays. Jiuxiang Gu et al[2]. had reported that (as shown in Figure 2) CNN was used in many fields, such as image classification, object detection, pose estimation, visual saliency detection, and action recognition. For these applications, CNN is most used to extract and identify information from the pictures. For convolution, it is a mathematical operation that produces a third function with two other functions. In this research, the convolution operation is a matrix and acts as an image filter that extracts specified information from a picture. In addition, there are two important concepts about CNN, translation invariance, and weight sharing. For example, in a picture, there is a cat. No matter the cat is on the left or right of the picture, it could also be identified by CNN. This is translation invariance. To achieve this, the weighted value and bias should be same in the same layer, which is weight sharing.

Figure. 2 Hierarchically structured taxonomy

Figure 3 shows how the CNN algorithm works. For an image, the filter, which is also called kernels, is the same in one feature maps according to weight sharing. There could be several different filters, and each one will extract and identify different information from the input image. After convolutions, there would be an operation called subsampling or pooling, which combine the outputs of neuron clusters into one single neuron. For example, max pooling would use the maximum value from a 2*2 clusters into one neuron in the next layer. After repeating above two steps, full connections neural network would be used to get the final results.

Figure. 3 Processing of CNN algorithms

2.2 Related Work Based on CNN

The convolutional descriptors are mostly used to obtain localized information because they extract more mid-level features such as object parts compared to features of fully connected layers (i.e. whole image). [3]
There are several famous CNN architectures:

LetNet, which is the first famous CNN application realized by Yann LeCun in the 1990s;
AlexNet, implemented by Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton consists of 5 convolutional layers followed by max-pooling layers, and 3 fully connected layers with a final 1000 way softmax. [4]
GoogLeNet, implemented by Christian Szegedy, reduced parameters drastically compared to AlexNet (60M to 4M). [5]
VGGNet, realized by Karen Simonyan and Andrew Zisserman, include 16 and 19 weighted layers.
ResNet, realized by Kaiming He. It utilizes skip connections or short-cuts to jump over some layers to avoid the problem of vanishing gradient. [6]
DenseNet created by Gao Huang, Zhuang Liu, Laurens van der Maaten and Kilian Q. Weinberger. DenseNet is a densely connected convolutional neural network. [7]

However, many of the part-based fine-grained approaches directly used the deep convolutional descriptor, regardless of the usefulness of the part deep descriptors. Since ﬁne-grained recognition doesn’t benefit from most deep descriptors, Xiu-Shen Wei, et al. [8] proposed a novel end-to-end Mask-CNN mode to select useful dimensions inside feature vectors.

3. The Approach

As the related work shows, fine-grained image classification is extremely challenging in image classification. In the academic and industrial worlds, CNN-based methods are generally used to solve fine-grained classification problems nowadays.

After studying in Computer Vision class and reading a lot of papers about computer vision and deep learning, we are prepared to use AlexNet and DenseNet based on CNN Network to solve the refined classification of furniture classification respectively in this project. This is also based on the development of convolutional neural networks on CV.

3.1 Introduction to AlexNet Approach

AlexNet is the model structure used by Hinton and his student Alex Krizhevsky [4] in the 2012 year ImageNet Challenge, refreshing the accuracy of Image Classification. Like Figure 4, draw AlexNet's network structure using the draw_net Function of Caffe.

Figure 4. AlexNet's Network Structure

AlexNet includes a total of 8 layers, of which the first 5 layers are convolutional, and the last 3 layers are full-connected. The article says that reducing any one of convolution, the results will become very bad.[4] Now let me talk specifically about the composition of each layer:

The first layer convolution layer input image is 227x227x3 pixels, 96 kernels(96,11,11,3) are used, and the right pixel is moved to the right or down by 4 pixels. It can generate 55x55 volumes. The value of the rectangle after the product, and then the response-normalized (in fact, Local Response Normalized) and pooled. AlexNet samples from both GPUs, so the first convolutional layer thickness consists of two parts, pooled pool_size=(3,3), and a sliding step of 2 pixels, resulting in 96x27x27 features.
The second convolution layer uses 256 convolution kernels (distributed on two GPUs, each 128 kernels (5x5x48)), doing pad_size(2,2) processing, moving in units of 1 pixel, It can generate 27x27 convoluted matrix boxes, do Local Response Normalized processing, then pooled, pool 3x3 rectangles, 2 pixels as steps, and get 256x13x13 features.
There are no Local Response Normalized and pools in the third and fourth layer. The fifth layer has only pools. The third layer uses 384 kernels (3x3x256, pad_size=(1,1)) to get 256x15x15, kernel_size. For (3,3), with 1 pixel as the step, get 256x13x13); the fourth layer uses 384 kernels(pad_size(1,1) to get 256x15x15, the kernel size is (3, 3) The step size is 1 pixel, get 384x13x13); The fifth layer uses 256 kernels(pad_size(1,1) to get 384x15x15, kernel_size(3,3), get 256x13x13, pool_size (3,3) steps 2 pixels, get 256x6x6).
Fully connected layer: The first two layers have 4096 neurons, and the final output Softmax is 1000 (ImageNet).

3.2 Introduction to DenseNet Approach

In 2012, the appearance of AlexNet made deep learning a new research hotspot. In AlexNet, some new optimization methods were used: Dropout, ReLu, and GPU+Big Data. Subsequent developments include increased network depth, enhanced convolution module functions, inspection tasks, and new functional units.

On ImageNet, the depth of the network has deepened with the reduction of errors. Before ResNet, there were rarely more than 20 layers of networks. ResNet solves the problem of gradient dispersion and allows deeper networks to be better trained. The Lth layer network is obtained by transforming the L-1 layer network through H (including Conv, BN, ReLU, and Pooling), and then directly connecting to the upper layer network. Makes the gradient better able to spread.

DenseNet is 2017 CVPR Best Paper. One obvious difference between it and ResNet is that ResNet is a summation, and DenseNet is a splicing. The input of each layer of the network includes the output of all the previous layers of the network. The input of layer L is equal to K x (L-1) + k0, where k is the growth rate and represents the number of channels in each layer. For example, the number of channels in the network is 4.

DenseNet improves the transmission efficiency of information and gradients in the network. Each layer can directly obtain the gradient from the loss function, and directly obtain the input signal, so that deeper networks can be trained. This kind of network structure also has the effect of regularization. Contrast with other networks to increase network performance from depth and width, DenseNet is committed to improving network performance from the perspective of feature reuse.

Figure 5. DenseNet [7]

In DenseNet network, there is a direct connection between any two layers, that is, the input of each layer of the network is the union of the outputs of all previous layers, and the characteristic maps learned by the layer are also directly transmitted to the network. Behind all layers as input. Figure 5 is a schematic of DenseNet from DenseNet Paper[7].

4. Implementation and Analysis

4.1 dataset analysis

The data is provided by FGVC5 workshop in iMaterialist challenge. The training dataset includes images from 128 furniture and home goods classes with one ground truth label for each image. It includes a total of 194,828 images for training and 6,400 images for validation and 12,800 images for testing. The number of train examples per class is heavily skewed, with about 4,000 examples in class 20 and less than 500 examples in class 83.

To visualize the data, A notebook was launched written by Andrew Ribeiro.

8 examples of category 22 and 23 are shown in Figure 6 and Figure 7. The data presents challenges for classification:

slight differences between similar sub-categories, such as Holland chair vs French chair
huge variety with one class, such as a view-angle variation, illumination conditions, and intra-class carination.
Some images are covered by artifacts
Some images are taken in the natural environment, while some are separated from the background.

Figure 6

Figure 7

4.2 Implementation AlexNet based on TensorFlow

1.Set up the convolution layer

2. Define AlexNet network parameters at each layer

3. Training AlexNet Network

4.3 Implement DenseNet Based on Pytorch

1. Choose DenseNet201() Model from Pytorch

2.Define DenseNet network parameters, Choose CUDA runs on GPUs.

3. Train DenseNet Network

4.4 Test and Result

The Network parameters have not yet been adjusted optimally. The result is as following:

AlexNet run 10 epochs, with Normal Gd, Acc is 65%.
DenseNet run 20 epochs, the optimizer was using Adam, and validation Acc is 79%.

4.5 Analysis
The accuracy of AlexNet and DenseNet can be improved. The training network is optimized for the characteristics of furniture data set.

In the training set, the number of sample furniture in some of the 228 types of furniture is much larger than other types. This leads to an imbalance in training data. Unbalanced categories can have a detrimental effect on classification performance.
Solution: The down-sampling and over-sampling can be used to reduce the imbalance of the data set.[9]
According to experience to adjust the hyperparameter settings to optimize the neural network.

5. Conclusions and Observations

1. Compared with AlexNet in 2012, the DenseNet in 2017 has some advantages:

Fewer parameters. To achieve the same accuracy on the data set, DenseNet requires 1/30 of the AlexNet parameters. For the industry, small models can significantly save bandwidth and reduce storage overhead.
Less computing resources. To achieve the same accuracy as AlexNet, DenseNet requires much less computational effort.
DenseNet has very good anti-overfitting performance and is especially suitable for applications where training data is relatively scarce.

2.Details to use when using DenseNet:

The bottleneck at the beginning of each layer (1x1 convolution) is very useful for reducing the number of parameters and calculations. The method is to add 1x1x96, 1x1x16 convolution kernels before 3x3, 5x5 convolution kernels, and add 1x1x32 convolution kernels after pooling. In the same way, the number of parameters is calculated. The number of previous channels is multiplied by the size of the current convolution kernel and the number of channels. The parameter amount of the current function module is 16w, so a 1x1 convolution kernel can be used to reduce the parameter by changing the channel size.

The narrower design of each layer will reduce DenseNet's computational efficiency on the GPU, but it may improve the computational efficiency of the CPU.

3. About the dataset provided by Kaggle

Many of the downloaded images can't be read and I dealt with this issue from the code. The URLs of many dataset images are invalid (about 5%).

4. Although the computer vision project is over, we will continue to optimize the network parameters and try other structured CNN networks.

Reference:

[1] N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Part-based R-CNNs for ﬁne-grained category detection. In ECCV 2014, Part I, LNCS 8689, pages 834–849, 2014.

[2] Jiuxiang Gu et. al, Recent Advances in Convolutional Neural Networks, arXiv:1512.07108

[3] M. D. Zeiler, G. W. Taylor, and R. Fergus. Adaptive deconvolutional networks for mid and high-level feature learning. In ICCV, pages 2018–2025, 2013.

[4] A. Krizhevsky, I. Sutskever, G. E. Hinton. Imagenet classification with deep convolutional neural networks, Advances in neural information processing systems. 2012.

[5] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich. Going Deeper with Convolutions. In CVPR, pages 1-9, 2015.

[6]K. He, X. Zhang, S. Ren, J. Sun. Deep Residual Learning for Image Recognition. Arxiv.org, 1512.03385, 2015.

[7] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. arXiv:1608.06993, 2016a.

[8] X. S. Wei, C-W. Xie, J. X. Wu. Mask-CNN: Localizing parts and selecting descriptors for fine-grained image recognition. arxiv.org, 1605.06878, 2016.

[9] Buda, M., Maki, A., & Mazurowski, M. A. (2017). A systematic study of the class imbalance problem in convolutional neural networks. arXiv preprint arXiv:1710.05381.

最后编辑于：2018.05.10 11:40:56

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 204,293评论 6赞 478
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,604评论 2赞 381
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 150,958评论 0赞 337
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,729评论 1赞 277
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,719评论 5赞 366
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,630评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,000评论 3赞 397
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,665评论 0赞 258
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,909评论 1赞 299
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,646评论 2赞 321
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,726评论 1赞 330
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,400评论 4赞 321
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,986评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,959评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,197评论 1赞 260
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 44,996评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,481评论 2赞 342

Project 4-Computer Vision-Image Classification of Furniture & Home Goods

推荐阅读更多精彩内容