Computer Vision-Image Classification of Furniture & Home Goods
Group Members:
Li Qin, lemeoqin@gwu.edu GWID: G24923341
Meixin Wang, meixin_wang@gwu.edu GWID: G23968837
This is a class project for CSCI 4527/6527 Computer Vision at George Washington University.
1. Introduction
1.1. Problem and real-world application
The boom of e-commerce enables consumers to shop everywhere. Shoppers tend to move from offline to online stores for the convenience of searching for massive goods and looking for the best deal.
Online shopping represents challenges in indexing and searching, which requires accurate and fast recognition of product categories. In traditional e-commerce, tags are defined and added by sellers. The main problem of manual classification is, on the one hand, energy and effort spent on processing a mass of data. On the other hand, the personally defined attribute may not sufficiently characterize all the visual style elements. Especially on platforms like eBay, the same item is listed with different tags by different users.
The problems of labor cost and label inconsistency can be addressed by automatic image recognition and classification realized by deep learning. However, many of today’s recognition machines are confronted with challenges in real-world applications, where fine-grained visual categorization is required. It is difficult for the machine to perceive subtle differences between attribute sub-categories, such as stock pot vs. saucepan, yet these differences could be important for shopping decisions. Tackling issues like this are also the subject of iMaterialist Challenge (Furniture) at FGVC5 from Kaggle.
Lots of applications can be enhanced by applying fine-grained furniture classification:
More specific searching function. For example, users of Craigslist can search for “antique Chinese dish set” rather than “plate” because more tags can be added to a large amount images automatically in computer vision.
Similar style retrieval. In Taobao, the online store operated by Alibaba, the “look for similar items” function is popular, where shoppers can find similar items in popular TV shows. Automatic labeling can provide accurate result based on subtle visual differences.
Interior design recommendation. Amazon can recommend loft bed to shoppers who have viewed images of kids plate in online store and market among target consumers.
1.2. Example Output
For example, on Craigslist, after seller upload one image, the fine-grained tag is generated automatically. The images will be listed in different sub-categories of one attribute to provide more accurate searching and traversal.
2. Related Work
Image classification is the task of predicting the label of an input image from a fixed set of categories. This is one of the core problems in Computer Vision. Challenges associated with this task include viewpoint variation, scale variation, deformation, occlusion, illumination conditions, background clutter and intra-class variation.
Intra-class variation refers to relatively broad subclasses within one class of interest, which can be coped with by fine-grained classification. In contrast to general object recognition task, it is more efficient to learn differences in critical parts of the same class to discriminate sub-classes rather than learning the whole image. [1]
2.1 A Review of CNN
Convolutional Neural Network (CNN) is a kind of deep and feed-forward neural networks. It attracts many attentions nowadays. Jiuxiang Gu et al[2]. had reported that (as shown in Figure 2) CNN was used in many fields, such as image classification, object detection, pose estimation, visual saliency detection, and action recognition. For these applications, CNN is most used to extract and identify information from the pictures. For convolution, it is a mathematical operation that produces a third function with two other functions. In this research, the convolution operation is a matrix and acts as an image filter that extracts specified information from a picture. In addition, there are two important concepts about CNN, translation invariance, and weight sharing. For example, in a picture, there is a cat. No matter the cat is on the left or right of the picture, it could also be identified by CNN. This is translation invariance. To achieve this, the weighted value and bias should be same in the same layer, which is weight sharing.
Figure 3 shows how the CNN algorithm works. For an image, the filter, which is also called kernels, is the same in one feature maps according to weight sharing. There could be several different filters, and each one will extract and identify different information from the input image. After convolutions, there would be an operation called subsampling or pooling, which combine the outputs of neuron clusters into one single neuron. For example, max pooling would use the maximum value from a 2*2 clusters into one neuron in the next layer. After repeating above two steps, full connections neural network would be used to get the final results.
2.2 Related Work Based on CNN
The convolutional descriptors are mostly used to obtain localized information because they extract more mid-level features such as object parts compared to features of fully connected layers (i.e. whole image). [3]
There are several famous CNN architectures:
LetNet, which is the first famous CNN application realized by Yann LeCun in the 1990s;
AlexNet, implemented by Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton consists of 5 convolutional layers followed by max-pooling layers, and 3 fully connected layers with a final 1000 way softmax. [4]
GoogLeNet, implemented by Christian Szegedy, reduced parameters drastically compared to AlexNet (60M to 4M). [5]
VGGNet, realized by Karen Simonyan and Andrew Zisserman, include 16 and 19 weighted layers.
ResNet, realized by Kaiming He. It utilizes skip connections or short-cuts to jump over some layers to avoid the problem of vanishing gradient. [6]
DenseNet created by Gao Huang, Zhuang Liu, Laurens van der Maaten and Kilian Q. Weinberger. DenseNet is a densely connected convolutional neural network. [7]
However, many of the part-based fine-grained approaches directly used the deep convolutional descriptor, regardless of the usefulness of the part deep descriptors. Since fine-grained recognition doesn’t benefit from most deep descriptors, Xiu-Shen Wei, et al. [8] proposed a novel end-to-end Mask-CNN mode to select useful dimensions inside feature vectors.
3. The Approach
As the related work shows, fine-grained image classification is extremely challenging in image classification. In the academic and industrial worlds, CNN-based methods are generally used to solve fine-grained classification problems nowadays.
After studying in Computer Vision class and reading a lot of papers about computer vision and deep learning, we are prepared to use AlexNet and DenseNet based on CNN Network to solve the refined classification of furniture classification respectively in this project. This is also based on the development of convolutional neural networks on CV.
3.1 Introduction to AlexNet Approach
AlexNet is the model structure used by Hinton and his student Alex Krizhevsky [4] in the 2012 year ImageNet Challenge, refreshing the accuracy of Image Classification. Like Figure 4, draw AlexNet's network structure using the draw_net Function of Caffe.
AlexNet includes a total of 8 layers, of which the first 5 layers are convolutional, and the last 3 layers are full-connected. The article says that reducing any one of convolution, the results will become very bad.[4] Now let me talk specifically about the composition of each layer:
The first layer convolution layer input image is 227x227x3 pixels, 96 kernels(96,11,11,3) are used, and the right pixel is moved to the right or down by 4 pixels. It can generate 55x55 volumes. The value of the rectangle after the product, and then the response-normalized (in fact, Local Response Normalized) and pooled. AlexNet samples from both GPUs, so the first convolutional layer thickness consists of two parts, pooled pool_size=(3,3), and a sliding step of 2 pixels, resulting in 96x27x27 features.
The second convolution layer uses 256 convolution kernels (distributed on two GPUs, each 128 kernels (5x5x48)), doing pad_size(2,2) processing, moving in units of 1 pixel, It can generate 27x27 convoluted matrix boxes, do Local Response Normalized processing, then pooled, pool 3x3 rectangles, 2 pixels as steps, and get 256x13x13 features.
There are no Local Response Normalized and pools in the third and fourth layer. The fifth layer has only pools. The third layer uses 384 kernels (3x3x256, pad_size=(1,1)) to get 256x15x15, kernel_size. For (3,3), with 1 pixel as the step, get 256x13x13); the fourth layer uses 384 kernels(pad_size(1,1) to get 256x15x15, the kernel size is (3, 3) The step size is 1 pixel, get 384x13x13); The fifth layer uses 256 kernels(pad_size(1,1) to get 384x15x15, kernel_size(3,3), get 256x13x13, pool_size (3,3) steps 2 pixels, get 256x6x6).
Fully connected layer: The first two layers have 4096 neurons, and the final output Softmax is 1000 (ImageNet).
3.2 Introduction to DenseNet Approach
In 2012, the appearance of AlexNet made deep learning a new research hotspot. In AlexNet, some new optimization methods were used: Dropout, ReLu, and GPU+Big Data. Subsequent developments include increased network depth, enhanced convolution module functions, inspection tasks, and new functional units.
On ImageNet, the depth of the network has deepened with the reduction of errors. Before ResNet, there were rarely more than 20 layers of networks. ResNet solves the problem of gradient dispersion and allows deeper networks to be better trained. The Lth layer network is obtained by transforming the L-1 layer network through H (including Conv, BN, ReLU, and Pooling), and then directly connecting to the upper layer network. Makes the gradient better able to spread.
DenseNet is 2017 CVPR Best Paper. One obvious difference between it and ResNet is that ResNet is a summation, and DenseNet is a splicing. The input of each layer of the network includes the output of all the previous layers of the network. The input of layer L is equal to K x (L-1) + k0, where k is the growth rate and represents the number of channels in each layer. For example, the number of channels in the network is 4.
DenseNet improves the transmission efficiency of information and gradients in the network. Each layer can directly obtain the gradient from the loss function, and directly obtain the input signal, so that deeper networks can be trained. This kind of network structure also has the effect of regularization. Contrast with other networks to increase network performance from depth and width, DenseNet is committed to improving network performance from the perspective of feature reuse.
In DenseNet network, there is a direct connection between any two layers, that is, the input of each layer of the network is the union of the outputs of all previous layers, and the characteristic maps learned by the layer are also directly transmitted to the network. Behind all layers as input. Figure 5 is a schematic of DenseNet from DenseNet Paper[7].
4. Implementation and Analysis
4.1 dataset analysis
The data is provided by FGVC5 workshop in iMaterialist challenge. The training dataset includes images from 128 furniture and home goods classes with one ground truth label for each image. It includes a total of 194,828 images for training and 6,400 images for validation and 12,800 images for testing. The number of train examples per class is heavily skewed, with about 4,000 examples in class 20 and less than 500 examples in class 83.
To visualize the data, A notebook was launched written by Andrew Ribeiro.
8 examples of category 22 and 23 are shown in Figure 6 and Figure 7. The data presents challenges for classification:
- slight differences between similar sub-categories, such as Holland chair vs French chair
- huge variety with one class, such as a view-angle variation, illumination conditions, and intra-class carination.
- Some images are covered by artifacts
- Some images are taken in the natural environment, while some are separated from the background.
4.2 Implementation AlexNet based on TensorFlow
1.Set up the convolution layer
2. Define AlexNet network parameters at each layer
3. Training AlexNet Network
4.3 Implement DenseNet Based on Pytorch
1. Choose DenseNet201() Model from Pytorch
2.Define DenseNet network parameters, Choose CUDA runs on GPUs.
3. Train DenseNet Network
4.4 Test and Result
The Network parameters have not yet been adjusted optimally. The result is as following:
AlexNet run 10 epochs, with Normal Gd, Acc is 65%.
DenseNet run 20 epochs, the optimizer was using Adam, and validation Acc is 79%.
4.5 Analysis
The accuracy of AlexNet and DenseNet can be improved. The training network is optimized for the characteristics of furniture data set.
- In the training set, the number of sample furniture in some of the 228 types of furniture is much larger than other types. This leads to an imbalance in training data. Unbalanced categories can have a detrimental effect on classification performance.
Solution: The down-sampling and over-sampling can be used to reduce the imbalance of the data set.[9] - According to experience to adjust the hyperparameter settings to optimize the neural network.
5. Conclusions and Observations
1. Compared with AlexNet in 2012, the DenseNet in 2017 has some advantages:
Fewer parameters. To achieve the same accuracy on the data set, DenseNet requires 1/30 of the AlexNet parameters. For the industry, small models can significantly save bandwidth and reduce storage overhead.
Less computing resources. To achieve the same accuracy as AlexNet, DenseNet requires much less computational effort.
DenseNet has very good anti-overfitting performance and is especially suitable for applications where training data is relatively scarce.
2.Details to use when using DenseNet:
- The bottleneck at the beginning of each layer (1x1 convolution) is very useful for reducing the number of parameters and calculations. The method is to add 1x1x96, 1x1x16 convolution kernels before 3x3, 5x5 convolution kernels, and add 1x1x32 convolution kernels after pooling. In the same way, the number of parameters is calculated. The number of previous channels is multiplied by the size of the current convolution kernel and the number of channels. The parameter amount of the current function module is 16w, so a 1x1 convolution kernel can be used to reduce the parameter by changing the channel size.
- The narrower design of each layer will reduce DenseNet's computational efficiency on the GPU, but it may improve the computational efficiency of the CPU.
3. About the dataset provided by Kaggle
- Many of the downloaded images can't be read and I dealt with this issue from the code. The URLs of many dataset images are invalid (about 5%).
4. Although the computer vision project is over, we will continue to optimize the network parameters and try other structured CNN networks.
Reference:
[1] N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Part-based R-CNNs for fine-grained category detection. In ECCV 2014, Part I, LNCS 8689, pages 834–849, 2014.
[2] Jiuxiang Gu et. al, Recent Advances in Convolutional Neural Networks, arXiv:1512.07108
[3] M. D. Zeiler, G. W. Taylor, and R. Fergus. Adaptive deconvolutional networks for mid and high-level feature learning. In ICCV, pages 2018–2025, 2013.
[4] A. Krizhevsky, I. Sutskever, G. E. Hinton. Imagenet classification with deep convolutional neural networks, Advances in neural information processing systems. 2012.
[5] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich. Going Deeper with Convolutions. In CVPR, pages 1-9, 2015.
[6]K. He, X. Zhang, S. Ren, J. Sun. Deep Residual Learning for Image Recognition. Arxiv.org, 1512.03385, 2015.
[7] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. arXiv:1608.06993, 2016a.
[8] X. S. Wei, C-W. Xie, J. X. Wu. Mask-CNN: Localizing parts and selecting descriptors for fine-grained image recognition. arxiv.org, 1605.06878, 2016.
[9] Buda, M., Maki, A., & Mazurowski, M. A. (2017). A systematic study of the class imbalance problem in convolutional neural networks. arXiv preprint arXiv:1710.05381.