当前位置:首页 >> 教育学/心理学 >>

Unsupervised Learning of Visual Representations using Videos


Unsupervised Learning of Visual Representations using Videos
Xiaolong Wang, Abhinav Gupta The Robotics Institute, Carnegie Mellon University
{xiaolonw, abhinavg}@cs.cmu.edu

arXiv:1505.00687v1 [cs.CV] 4 May 2015

Abstract
Is strong supervision necessary for learning a good visual representation? Do we really need millions of semantically-labeled images to train a ConvNet? In this paper, we present a simple yet surprisingly powerful approach for unsupervised learning of ConvNets. Speci?cally, we use hundreds of thousands of unlabeled videos from the web to learn visual representations. Our key idea is that we track millions of patches in these videos. Visual tracking provides the key supervision. That is, two patches connected by a track should have similar visual representation in deep feature space since they probably belong to same object or object part. We design a Siamese-triplet network with a ranking loss function to train this ConvNet representation. Without using a single image from ImageNet, just using 100K unlabeled videos and the VOC 2012 dataset, we train an ensemble of unsupervised networks that achieves 52% mAP (no bounding box regression). This performance comes tantalizingly close to its ImageNet-supervised counterpart, an ensemble which achieves a mAP of 54.4%. We also show that our unsupervised network can perform competitive in other tasks such as surface-normal estimation.






(a) Unsupervised Tracking in Videos



Learning to Rank

Conv Net

Conv Net

Conv Net



,



,


Query Tracked Negative

,



,

: Distance in deep feature space
(c) Ranking Objective

(First Frame) (Last Frame) (Random)

(b) Siamese-triplet Network

Figure 1. Overview of our approach. (a) Given unlabeled videos, we perform unsupervised tracking on the patches in them. (b) Triplets of patches including query patch in the initial frame of tracking, tracked patch in the last frame, and random patch from other videos are fed into our siamese-triplet network for training. (c) The learning objective: Distance between the query and tracked patch in feature space should be smaller than the distance between query and random patches.

1. Introduction
What is a good visual representation and how can we learn it? At the start of this decade, most computer vision research focused on “what” and used hand-de?ned features such as SIFT [29] and HOG [5] as the underlying visual representation. Learning was often the last step where these low-level feature representations were mapped to semantic/3D/functional categories. However, the last three years have seen the resurgence of learning visual representations directly from pixels themselves using the deep learning and ConvNets [25, 21, 20]. At the heart of ConvNets is a completely supervised learning paradigm. Often millions of examples are ?rst labeled using Mechanical Turk followed by data augmentation to create tens of millions of training instances. ConvNets are then trained using gradient descent 1 and back propagation. But one question still remains: is strong-supervision necessary for training these ConvNets? Do we really need millions of semantically-labeled images to learn a good visual representation? It seems humans can learn visual representations using little or no semantic supervision but our current learning approaches still remain completely supervised. In this paper, we explore the alternative: how we can exploit the unlabeled visual data on the web to train ConvNets (e.g. AlexNet [21])? In the past, there have been several attempts at unsupervised learning using millions of static images [23, 41] or frames extracted from videos [50, 44, 31]. The most common architecture used is an auto-encoder which learns representations based on its ability to recon-

struct the input images [32, 3, 45, 33]. While these approaches have been able to automatically learn V1-like ?lters given unlabeled data, they are still far away from supervised approaches on tasks such as object detection. So, what is the missing link? We argue that static images themselves might not have enough information to learn a good visual representation. But what about videos? Do they have enough information to learn visual representations? In fact, humans also learn their visual representations not from millions of static images but years of dynamic sensory inputs. Can we have similar learning capabilities for ConvNets? We present a simple yet surprisingly powerful approach for unsupervised learning of ConvNets using hundreds and thousands of unlabeled videos from the web. Visual tracking is one of the ?rst capabilities that develops in infants and often before semantic representations are learned1 . Taking a leaf from this observation, we propose to exploit visual tracking for learning ConvNets in an unsupervised manner. Speci?cally, we track millions of “moving” patches in hundreds of thousands of videos. Our key idea is that two patches connected by a track should have similar visual representation in deep feature space since they probably belong to same object. We design a Siamese-triplet network with ranking loss function to train the ConvNet representation. This ranking loss function enforces that in the ?nal deep feature space the ?rst frame patch should be much closer to the tracked patch than any other randomly sampled patch. We demonstrate the strength of our learning algorithm using extensive experimental evaluation. Without using a single image from ImageNet [34], just using 100K unlabeled videos and VOC 2012 dataset, we train an ensemble of AlexNet networks that achieves 52% mAP (no bounding box regression) . This performance is similar to its ImageNet-supervised counterpart, an ensemble which achieves a mAP of 54.4%. We also show that our network trained using unlabeled videos also achieves similar performance to its completely supervised counterpart on other tasks such as surface normal estimation. To the best of our knowledge, the results reported in this paper come closest to standard supervised ConvNets approaches (which use millions of semantically-labeled images) among unsupervised approaches.

2. Related Work
Unsupervised learning of visual representations has a rich and diverse history starting from original auto-encoders work of Olhausen and Field [32] and early generative models. Most of the work in the area of unsupervised learning can be broadly divided into three categories. The ?rst class of unsupervised learning algorithms focus on learning generative models with strong priors [17, 49, 42]. These al1 http://www.aoa.org/patients-and-public/good-vision-throughoutlife/childrens-vision/infant-vision-birth-to-24-months-of-age

gorithms essentially capture co-occurence statistics of features. The second class of algorithms use manually de?ned features such as SIFT or HOG and perform clustering over training data to discover semantic classes [39, 35]. Some of these recent algorithms also focus on learning mid-level representations rather than discovering semantic classes themselves [38, 6, 7]. The third class of algorithms and more related to our paper is unsupervised learning of visual representations from the pixels themselves using deep learning approaches [18, 23, 41, 36, 26, 43, 8, 30, 2, 45]. Starting from the seminal work of Olhausen and Field [32], the goal is to learn visual representations which are (a) sparse and (b) reconstructive. Olhausen and Field [32] showed that using this criteria they can learn V1-like ?lters directly from the data. However, this work only focused on learning a single layer. This idea was extended by Hinton and Salakhutdinov [18] to train a deep belief network in an unsupervised manner via stacking layer-by-layer RBMs. Similar to this, Bengio et al. [3] investigated stacking of both RBMs and autoencoders. As a next step, Le et al. [23] scaled up the learning of multi-layer autoencoder on large-scale unlabeled data. They demonstrated that although the network is trained in an unsupervised manner, the neurons in high layers can still have high responses on semantic objects such as human heads and cat faces. Sermanet et al. [36] applied convolutional sparse coding to pre-train the model layer-by-layer in unsupervised manner. The model is then ?ne-tuned for the pedestrian detection task on the labeled datasets. However, it is not clear if static images is the right way to learn visual representations. Therefore, researchers have started focusing on learning feature representations using videos [24, 40, 50, 14, 44, 31]. Early work such as [50] focused on inclusion of constraints via video to autoencoder framework. The most common constraint is the smoothing constraints which enforces learned representations to be temporally smooth. Similar to this, Goroshin et al. [14] proposed to learn auto-encoders based on the slowness prior. Other approaches such as Taylor et al. [44] trained convolutional gated RBMs to learn latent representations from pairs of successive images. This was extended in a recent work by Srivastava et al. [40] where they proposed to learn a LSTM model in an unsupervised manner. Given a few consecutive frames, their optimization goal for LSTM model includes reconstructing the given frames and predicting the future frames. Our work differs from this body of work in two aspects: (a) We train our model with patches obtained from tracking; (b) Instead of training auto-encoders, we train a deep ConvNet which can be transferred to different challenging vision tasks. Finally, our work is also related to metric learning via deep networks [47, 28, 4, 15, 13, 19]. For example, Chopra et al. [4] proposed to learn convolutional networks in a

Small Motion

Camera Motion

Tracking

Sliding Window Searching





Query (First Frame)

Tracked (Last Frame)

Figure 2. The process of patch mining in videos. Given the video about buses (the “bus” label are not utilized), we perform IDT on it. On the top three pairs of images, red points represents the SURF feature points, green represents the trajectories for the points. We reject the frames with small motions (middle pairs) as well as frames with large camera motion as the camera zoom in (right pairs). Given the selected frame, we perform sliding window on it to ?nd the bounding box containing most of the moving SURF points. As illustrated in the bottom line, given the initial bounding box in the frame, we perform tracking along the video for 30 frames. The query patch in the ?rst frame and tracked patch in the last frame are collected as one pair of training samples.

siamese architecture for face veri?cation. Wang et al. [47] introduced a deep triplet ranking network to learn ?negrained image similarity. However, all these methods required labeled data. Our work is also related to [27], which used ConvNets pre-trained on ImageNet classi?cation and detection dataset as initialization, and performed semi-supervised learning in videos to solve object detection in target domain. However, in our work, we propose an unsupervised approach instead of semi-supervised algorithm.

ture space than the ?rst one and a random one. But training a network with such triplets converges fast since the task is easy to over?t to. One way is to increase the number of training triplets. However, after initial convergence most triplets satisfy the loss function and therefore back-propagating gradients using such triplets is inef?cient. Instead, analogous to hard-negative mining, we select the third patch from multiple patches that violates the constraint (loss is maximum). Selecting this patch leads to more meaningful gradients for faster learning.

3. Overview
Our goal is to train convolutional neural networks using hundreds and thousands of unlabeled videos from the Internet. We follow the AlexNet architecture [21] to design our base network. However, since we do not have labels, it is not clear what should be the loss function and how we should optimize it. But in case of videos, we have another supervisory information: time. For example, we all know that the scene does not change drastically within a short time in a video and same object instances appear in multiple frames of the video. So, how do we exploit this information to train a ConvNet-based representation? We sample millions of patches in these videos and track them over time. Since we are tracking these patches, we know that the ?rst and last tracked frames correspond to the same instance of the moving object or object part. Therefore, any visual representation that we learn should keep these two data points close in the feature space. But just using this constraint is not suf?cient: all points can be mapped to a single point in feature space. Therefore, for training our ConvNet, we sample a third patch which creates a triplet. For training, we use a loss function [47] that enforces that the ?rst two patches connected by tracking are closer in fea-

4. Patch Mining in Videos
The ?rst step in our learning procedure is to extract training instances from videos. In our case, every training instance for learning the deep network consists of three patches. The loss function enforces the pairs of patches connected by tracks to have more similar representations as compared to any other two randomly selected patches. But what do these patches that are tracked correspond to? Since our videos are unlabeled, the location and the extent of the objects in the frame are unknown. Therefore, instead of trying to extract patches corresponding to semantic objects, we focus on using motion information in the video to extract moving image patches. Note that these patches might contain objects or part of an object as shown in Figure 2. These patches are then tracked over time to obtain a second patch. The initial and tracked patches are then grouped together. The details are explained below. Given a video, we want to extract patches of interest (patches with motion in our case) and track these patches to create training instances. One obvious way to ?nd patches of interest is to compute optical ?ow and use the high magnitude ?ow regions. However, since YouTube videos are

Query (First Frame)

Tracked (Last Frame)

Patch Pairs

Query (First Frame)

Tracked (Last Frame)

Patch Pairs

Figure 3. Examples of patch pairs we obtain via patch mining in the videos.

noisy with a lot of camera motion, it is hard to localize moving objects using simple optical ?ow magnitude vectors. Thus we follow a two-step approach: in the ?rst step, we obtain SURF [1] interest points and use Improved Dense Trajectories (IDT) [46] to obtain motion of each SURF point. Note that since IDT applies a homography estimation (video stabilization) method, it reduces the problem caused by camera motion. Given the trajectories of SURF interest points, we classify these points as moving if the ?ow magnitude is more than 0.5 pixels. We also reject frames if (a) very few (< 25%) SURF interest points are classi?ed as moving because it might be just noise; (b) majority of SURF interest points (> 75%) are classi?ed as moving as it corresponds to moving camera. Once we have extracted moving SURF interest points, in the second step, we ?nd the best bounding box such that it contains most of the moving SURF points. The size of the bounding box is set as h × w, and we perform sliding window with it in the frame. We take the bounding box which contains the most number of moving SURF interest points as the interest bounding box. In the experiment, we set h = 227, w = 227 in the frame with size 448 × 600. Tracking. Given the initial bounding box, we perform tracking using the KCF tracker [16]. After tracking along 30 frames in the video, we obtain the second patch. This patch acts as the similar patch to the query patch in the triplet. Note that the KCF tracker does not use any supervised information except for the initial bounding box.

randomly sampled patch. To learn this feature space we design a Siamese-triplet network. A Siamese-triplet network consist of three base networks which share the same parameters (see Figure 4). For our experiments, we take the image with size 227 × 227 as input. The base network is based on the AlexNet architecture [21] for the convolutional layers. Then we stack two full connection layers on the pool5 outputs, whose neuron numbers are 4096 and 1024 respectively. Thus the ?nal output of each single network is 1024 dimensional feature space f (·). We de?ne the loss function on this feature space.

5.2. Ranking Loss Function
Given the set of patch pairs S sampled from the video, we propose to learn an image similarity model in the form of ConvNet. Speci?cally, given an image X as an input for the network, we can obtain its feature in the ?nal layer as f (X ). Then, we de?ne the distance of two image patches X1 , X2 based on the cosine distance in the feature space as,
D(X1 , X2 ) = 1 ? f (X1 ) · f (X2 ) . f (X1 ) f (X2 ) (1)

5. Learning Via Videos
In the previous section, we discussed how we can use tracking to generate pairs of patches where the ?rst patch (query) is initialized based on motion and the second patch is obtained after tracking the query patch for 30 frames. We use this procedure to generate millions of such pairs (See Figure 3 for examples of pairs of patches mined). We now describe how we use these as training instances for our visual representation learning.

We want to train a ConvNet to obtain feature representation f (·), so that the distance between query image patch and the tracked patch is small and the distance between query patch and other random patches is encouraged to be larger. Formally, given the patch set S, where Xi is the + original query patch (?rst patch in tracked frames), Xi is ? the tracked patch and Xi is a random patch, we want to ? + enforce D(Xi , Xi ) > D(Xi , Xi ). + ? Given a triplet of image patches Xi , Xi , Xi as input, + ? where Xi , Xi is a tracked pair and Xi is obtained from a different video, the loss of our ranking model is de?ned by hinge loss as,
+ ? + ? L(Xi , Xi , Xi ) = max{0, D(Xi , Xi ) ? D(Xi , Xi ) + M },

(2)

where M represents the gap parameters between two distances. We set M = 0.5 in the experiment. Then our objective function for training can be represented as,
min
W

λ 2

N

W

2 2

+
i=1

+ ? max{0, D(Xi , Xi ) ? D(Xi , Xi ) + M },

(3)

5.1. Siamese Triplet Network
Our goal is to learn a feature space such that the query patch is closer to the tracked patch as compared to any other

where W is the parameter weights of the network, i.e., parameters for function f (·). N is the number of the triplets of samples. λ is a constant representing weight decay, which is set to λ = 0.0005.

+

(+ )

each sample is already computed after the forward propagation, we only need to calculate the loss over these features, thus the extra computation for hard negative mining is very small. For the experiments in this paper, we use K = 4.

Shared Weights
Shared Weights ? 96 256 384 384 256 4096 1024 Ranking Loss Layer

5.4. Adapting for Supervised Tasks
( )

(? )

Figure 4. Siamese-triplet network. Each base network in the Siamese-triplet network share the same architecture and parameter weights. The architecture is recti?ed from AlexNet by using only two full connection layers. Given a triplet of training samples, we obtain their features from the last layer by forward propagation and compute the ranking loss.

5.3. Hard Negative Mining for Triplet Sampling
One non-trivial part for learning to rank is the process of selecting negative samples. Given a pair of similar images + ? Xi , Xi , how can we select the patch Xi , which is a negative match to Xi , from the large pool of patches? Here we ?rst select the negative patches randomly, and then ?nd hard examples (in a process analogous to hard negative mining). Random Selection: During learning, we perform mini-batch Stochastic Gradient Descent (SGD). For each + Xi , Xi , we randomly sample K negative matches in the same batch B, thus we have K sets of triplet of samples. For every triplet of samples, we calculate the gradients over three of them respectively and perform back propagation. Note that we shuf?e all the images randomly after each + epoch of training, thus the pair of patches Xi , Xi can look at different negative matches each time. Hard Negative Mining: While one can continue to sample random patches for creating the triplets, it is more ef?cient to search the negative patches smartly. After 10 epochs of training using negative data selected randomly, we want to make the problem harder to get more robust feature representations. Analogous to hard-negative mining procedure in SVM, where gradient descent learning is only performed on hard-negatives (not all possible negative), we search for the negative patch such that the loss is maximum and use that patch to compute and back propagate gradients. Speci?cally, the sampling of negative matches is similar as random selection before, except that this time we select + according to the loss(Eq. 2). For each pair Xi , Xi , we calculate the loss of all other negative matches in batch B, and select the top K ones with highest losses. We apply the loss on these K negative matches as our ?nal loss and calculate the gradients over them. Notice that since the feature of

Given the ConvNet learned by using unsupervised data, we want to transfer the learned representations to the tasks with supervised data. In our experiments, we apply our model to two different tasks including object detection and surface normal estimation. In both tasks we take the base network from our Siamese-triplet network (which is based on AlexNet architecture) and adjust the full connection layers and outputs accordingly. We introduce two ways to ?netune and transfer the information obtained from unsupervised data to supervised learning. One straight forward approach is directly applying our ranking model as a pre-trained network for the target task. More speci?cally, we use the parameters of the convolutional layers in the base network of our triplet architecture as initialization for the target task. For the full connection layers, we initialize them randomly. This method of transferring feature representation is very similar to the approach applied in RCNN [12]. However, RCNN uses the network pre-trained with ImageNet Classi?cation data. In our case, the unsupervised ranking task is quite different from object detection and surface normal estimation. Thus, we need to adapt the learning rate to the ?ne-tuning procedure introduced in RCNN. We start with the learning rate with = 0.01 instead of 0.001 and set the same learning rate for convolutional layers and full connection layers. This setting is crucial since we want the pre-trained features to be used as initialization of supervised learning, and adapting the features to the new task. In this paper, we explore one more approach to transfer/?ne-tune the network. Speci?cally, we note that there might be more juice left in the millions of unsupervised training data (which could not be captured in the initial learning stage). Therefore, we use an iterative ?netuning scheme. Given the initial unsupervised network, we ?rst ?ne-tune using the PASCAL VOC data. Given the new ?ne-tuned network, we use this network to re-adapt to ranking triplet task. Here we again transfer convolutional parameters for re-adapting. Finally, this re-adapted network is ?ne-tuned on the VOC data yielding a better trained model. We show in the experiment that this circular approach gives improvement in performance. We also notice that after two iterations of this approach the network converges.

5.5. Model Ensemble
We proposed an approach to learn ConvNets using unlabeled videos. However, there is absolutely no limit to generating training instances and pairs of tracked patches (YouTube has more than billions of videos). This opens

Figure 5. Top regions for the pool5 neurons for the base network trained in unsupervised manner. The receptive ?eld for the pool5 neurons is 195 × 195 pixels. We use the red boxes to represent the receptive regions giving the neuron high responses. We show regions with top responses for 5 neurons in 5 rows.

up the possibility of training multiple ConvNets using different sets of data. Once we have trained these ConvNets, we append the fc7 features from each of these ConvNets to train the ?nal SVM. Note that the ImageNet trained models also provide initial boost for adding more networks (See Table 1).

learning rate by a factor of 10 at every 80K iterations and train for 240K iterations. For training on 5M patches, we reduce the learning rate by a factor of 10 at every 120K iterations and train for 350K iterations.

6. Experiments
We demonstrate the quality of our learned visual representations with qualitative and quantitative experiments. Qualitatively, we show the convolutional ?lters learned in layer 1 (See Figure 6). Our learned ?lters are similar to V1 though not as strong. However, after ?ne-tuning on PASCAL VOC 2012, these ?lters become quite strong. We also show that the underlying representation is reasonable by showing what the neurons in Pool5 layers represent (See Figure 5). We use the red bounding boxes to represent the receptive ?eld with top responses for ?ve different neurons(one neuron each line). We can see the clusters represented by these neurons are quite reasonable and correspond to semantic parts of objects. For example, the ?rst neuron represents animal heads, second represents potted plant, etc. For quantitative evaluations, we evaluate our approach by transferring the feature representation learned in unsupervised manner to the tasks with labeled data. We focus on two challenging problems: object detection and surface normal estimation.

5.6. Implementation Details
We apply mini-batch SGD in training. As the 3 networks share the same parameters, instead of inputting 3 samples to the triplet network each time, we perform the forward propagation for the whole batch by a single network and calculate the loss based on the output feature. + Given a pair of patches Xi , Xi , we randomly select an? other patch Xi ∈ B which is extracted in a different video + from Xi , Xi . Given their features from forward propaga+ ? tion f (Xi ), f (Xi ), f (Xi ), we can compute the loss according to Eq. 2. For learning, we download 100K videos from YouTube using the URLs provided by [27]. By performing our patch mining method on the videos, we obtain 8 million image patches. We train three different networks separately using 1.5M, 1.5M and 5M training instances. Therefore, we report number based on these three networks. To train our siamese-triplet networks, we set the batch size as |B| = 100, the learning rate starting with 0 = 0.001. For the dataset with 1.5M and 5M patches, we ?rst trained our network with random negative samples with this learning rate for 150K iterations, and then we apply hard negative mining based on it. For training on 1.5M patches, we reduce the

6.1. Object Detection
For object detection, we perform our experiments on PASCAL VOC 2012 dataset [9]. We follow the detection pipeline introduced in RCNN [12], which borrowed the

Table 1. mean Average Precision (mAP) on VOC 2012. The second column “external” represents the number of patches used to pre-train the model in the unsupervised manner.
VOC 2012 test scratch unsup + ft unsup + ft unsup + ft (2 ensemble) unsup + ft (3 ensemble) unsup + iterative ft RCNN 70K RCNN 70K (2 ensemble) RCNN 70K (3 ensemble) RCNN 200K (big stepsize) external 0 1.5M 5M 6.5M 8M 5M aero 66.1 68.8 69.0 72.4 73.5 67.7 72.7 75.3 74.6 73.3 bike 58.1 62.1 64.0 66.2 67.8 64.0 62.9 68.3 68.7 67.1 bird 32.7 34.7 37.1 41.3 43.5 41.3 49.3 53.1 54.9 46.3 boat 23.0 25.3 23.6 26.4 28.9 25.3 31.1 35.2 35.7 31.7 bottle 21.8 26.6 24.6 26.8 27.9 27.3 25.9 27.7 29.4 30.6 bus 54.5 57.7 58.7 61.0 62.3 58.8 56.2 59.6 61.0 59.4 car 56.4 59.6 58.9 61.9 62.6 60.3 53.0 54.7 54.4 61.0 cat 50.8 56.3 59.6 63.1 64.9 60.2 70.0 73.4 74.0 67.9 chair 21.6 22.0 22.3 25.3 27.3 24.3 23.3 26.5 28.4 27.3 cow 42.2 42.6 46.0 51.0 51.5 46.7 49.0 53.0 53.6 53.1 table 31.8 33.8 35.1 38.7 41.6 34.4 38.0 42.2 43.0 39.1 dog 49.2 52.3 53.3 58.1 59.1 53.6 69.5 73.1 74.0 64.1 horse mbike person plant 49.8 61.6 52.1 25.1 50.3 65.6 53.9 25.8 53.7 66.9 54.1 25.4 58.3 70.0 56.2 28.6 60.0 71.8 58.3 29.7 53.8 68.2 55.7 26.4 60.1 68.2 46.4 17.5 66.1 71.0 48.5 21.7 66.1 72.8 50.3 20.5 60.5 70.9 57.2 26.1 sheep 52.6 51.5 52.9 56.1 56.1 51.1 57.2 59.2 60.0 59.0 sofa 31.3 32.3 31.2 38.5 39.1 34.3 46.2 50.8 51.2 40.1 train 50.0 51.7 51.9 55.9 58.6 53.4 50.8 55.2 57.9 56.2 tv 49.1 51.8 51.8 54.3 55.6 52.3 54.1 58.0 58.0 54.9 mAP 44.0 46.2 47.0 50.5 52.0 48.0 50.1 53.6 54.4 52.3

(a) Unsupervised Pre-trained

tice that the results for VOC 2012 reported in RCNN [12] are obtained by only ?ne-tuning on the train set without using the val set. The mAP for VOC 2012 reported in [12] is 49.6%. For fair comparison, we ?ne-tuned the ImageNet pre-trained network with VOC 2012 trainval set. Moreover, as the step size of reducing learning rate in RCNN [12] is set to 20K and iterations for ?ne-tuning is 70K, we also try to enlarge the step size to 50K and ?ne-tune the network for 200K iterations. We report the results for both of these settings. Single Model. We show the results in Table 1. As a baseline, we train the network from scratch on VOC 2012 dataset and obtain 44% mAP. Using our unsupervised network pre-trained with 1.5M pair of patches and then ?netuned on VOC 2012, we obtain mAP of 46.2% (unsup+ft, external data = 1.5M). By looking into more data, using 5M patches in pre-training and then ?ne-tune, we can achieve 47% mAP (unsup+ft, external data = 5M). These results indicate that our unsupervised network provides a signi?cant boost as compared to the scratch network. More importantly, when more unlabeled data is applied, we can get better performance ( 3% boost compared to training from scratch). Model Ensemble. As looking at more external data in unsupervised pre-training gives the boost in performance, we also try combining different models using different unlabeled data in pre-training. By ensembling two ?ne-tuned networks which are pre-trained using 1.5M and 5M patches, we obtained a boost of 3.5% comparing to the single model, which is 50.5%(unsup+ft (2 ensemble)). By moving one step forward, we ensemble all three different networks pretrained with different sets of data, whose size are 1.5M, 1.5M and 5M respectively. We get another boost of 1.5% and reach 52% mAP(unsup+ft (3 ensemble)). Baselines. We also compare our approach with RCNN [12] which uses ImageNet pre-trained models. Following the procedure in [12], we obtain 50.1% mAP (RCNN 70K) by setting the step size to 20K and ?ne-tuning for 70K iterations. To generate a model ensemble, three ConvNets are ?rst trained on the ImageNet dataset sepa-

(b) Fine-tuned
Figure 6. Conv1 ?lters visualization. (a) The ?lters of the ?rst convolutional layer of the siamese-triplet network trained in unsupervised manner. (b) By ?ne-tuning the unsupervised pre-trained network on PASCAL VOC 2012, we obtain sharper ?lters.

ConvNets pre-trained on other datasets and ?ne-tuned on it using the VOC data. The ?ne-tuned ConvNet was then used to extract features followed by training SVMs for each object class. However, instead of using ImageNet pre-trained network as initialization in RCNN, we use our ConvNets trained in the unsupervised manner. Note that the network architecture is based on AlexNet. We ?ne-tune our network with the trainval set (11540 images) and train SVMs with the same images. Evaluation is performed in the standard test set (10991 images). At the ?ne-tuning stage, we change the output to 21 and initialize the convolutional layers with our unsupervised pre-trained network. To ?ne-tune the network, we start with learning rate as = 0.01 and reduce the learning rate by a factor of 10 at every 80K iterations. The network is ?netuned for 200K iterations. Note that for all the experiments, no bounding box regression is performed. We compare our method with the model trained from scratch as well as using ImagNet pre-trained network. No-

Table 2. Results on NYU v2 for per-pixel surface normal estimation, evaluated over valid pixels.

(Lower Better) (Higher Better) Mean Median 11.25? 22.5? 30? scratch unsup + ft ImageNet + ft UNFOLD [11] Discr. [22] 3DP (MW) [10] 38.6 35.2 33.3 35.1 32.5 36.0 26.5 23.2 20.8 19.2 22.4 20.5 33.1 34.9 36.7 37.6 27.4 35.9 46.8 49.4 51.7 53.3 50.2 52.0 52.5 55.8 58.1 58.9 60.2 57.8
Figure 7. Surface normal estimation results on NYU dataset. For visualization, we use green to represent horizontal surface, blue representing facing right and red representing facing left, i.e., blue → X; green → Y; red → Z.

rately, and then they are ?ne-tuned with the VOC 2012 dataset. The result of ensembling two of these networks is 53.6% mAP (RCNN 70K (2 ensemble )), and ensembling three of these networks gives 0.8% improvement, leading to 54.4% mAP (RCNN 70K (3 ensemble )). For fair of comparison, we also ?ne-tuned the ImageNet pre-trained model with larger step size(50K) and more iterations(200K). The result is 52.3% mAP (RCNN 200K (big stepsize)). Note that while ImageNet network shows diminishing returns with ensembling since the training data remains similar, in our case since every network in the ensemble looks at different sets of data, we get huge performance boosts. Exploring a better way to transfer learned representation. Given our ?ne-tuned model using 5M patches in pre-training (unsup+ft, external = 5M), we use it to re-learn and re-adapt to the unsupervised triplet task. After that, the network is re-applied to ?ne-tune on VOC 2012. We repeat this iterative approach twice in our experiment and ?nd it converges very quickly. The ?nal result for this single model is 48% mAP (unsup + iterative ft), which is 1% better than the initial ?ne-tuned network.

6.2. Surface Normal Estimation
To illustrate that our unsupervised representation can be generalized to different tasks, we adapt the unsupervised ConvNet to the task of surface normal estimation from a RGB image. In this task, we want to estimate the orientation of the pixels. We perform our experiments on the NYUv2 dataset [37], which includes 795 images for training and 654 images for testing. Each image is has corresponding depth information which can be used to generate groundtruth surface normals. For evaluation and generating the groundtruth, we adopt the protocols introduced in [10] which is used by different methods [10, 22, 11] on this task. To apply deep learning to this task, we followed the same form of outputs and loss function as the coarse network mentioned in [48]. Speci?cally, we ?rst learn a codebook by performing k-means on surface normals and generate 20 codewords. Each codeword represents one class and thus we transform the problem to 20-class classi?cation for each pixel. Given a 227 × 227 image as input, our network generates surface normals for the whole scene. The output of

our network is 20 × 20 pixels, each of which is represented by a distribution over 20 codewords. Thus the dimension of output is 20 × 20 × 20 = 8000. The network architecture for this task is also based on the AlexNet. To relieve over-?tting, we only stack two full connection layers with 4096 and 8000 neurons on the pool5 layer. During training, we initialize the network with the unsupervised pre-trained network. We use the same learning rate 1.0 × 10?6 as mentioned in [48] and ?ne-tune the network with 10K iterations given the small number of training data. Note that unlike [48], we do not utilize any data from the videos in NYU dataset for training. For comparisons, we also trained networks from scratch as well as using ImageNet pre-trained. We show our results in Table 2. Compared to the recent results which do not use external data [10, 22, 11], we show that we can get reasonable results even using this small amounts of training data. Our approach(unsup + ft) is generally 2 ? 3% better than network trained from scratch in 5 different metrics. We show a few qualitative results in Figure 7.
Acknowledgement: This work was partially supported by ONR MURI N000141010934, NSF IIS 1320083, and gifts from Google and Yahoo!. AG was partially supported by Bosch Young Faculty Fellowship. The authors would like to thank Yahoo! and Nvidia for the compute cluster and GPU donations respectively.

References
[1] H. Bay, T. Tuytelaars, and L. V. Gool. Surf: Speeded up robust features. In ECCV, 2006. 4 [2] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. TPAMI, 35(8):1798–1828, 2013. 2 [3] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In NIPS, 2007. 2 [4] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to face veri?cation. In CVPR, 2005. 2

[5] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005. 1 [6] C. Doersch, A. Gupta, and A. A. Efros. Mid-level visual element discovery as discriminative mode seeking. In NIPS, 2013. 2 [7] C. Doersch, A. Gupta, and A. A. Efros. Context as supervisory signal: Discovering objects with predictable context. In ECCV, 2014. 2 [8] S. M. A. Eslami, N. Heess, and J. Winn. The shape boltzmann machine: a strong model of object shape. In CVPR, 2012. 2 [9] M. Everingham, L. V. Gool, C. K. Williams, J. Winn, , and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 88(2):303–338, 2010. 6 [10] D. F. Fouhey, A. Gupta, and M. Hebert. Data-driven 3D primitives for single image understanding. In ICCV, 2013. 8 [11] D. F. Fouhey, A. Gupta, and M. Hebert. Unfolding an indoor origami world. In ECCV, 2014. 8 [12] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. 5, 7 [13] Y. Gong, Y. Jia, T. K. Leung, A. Toshev, and S. Ioffe. Deep convolutional ranking for multilabel image annotation. CoRR, abs/1312.4894, 2013. 2 [14] R. Goroshin, J. Bruna, J. Tompson, D. Eigen, and Y. LeCun. Unsupervised learning of spatiotemporally coherent metrics. CoRR, abs/1412.6056, 2015. 2 [15] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006. 2 [16] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. Highspeed tracking with kernelized correlation ?lters. TPAMI, 2015. 4 [17] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal. The” wake-sleep” algorithm for unsupervised neural networks. Science, 268(5214):1158–1161, 1995. 2 [18] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006. 2 [19] E. Hoffer and N. Ailon. Deep metric learning using triplet network. CoRR, abs/1412.6622, 2015. 2 [20] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. CoRR, abs/1408.5093, 2014. 1 [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi?cation with deep convolutional neural networks. In NIPS, 2012. 1, 3, 4 [22] L. Ladick? y, B. Zeisl, and M. Pollefeys. Discriminatively trained dense surface normal estimation. In ECCV, 2014. 8 [23] Q. V. Le, M. A. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, J. Dean, and A. Y. Ng. Building high-level features using large scale unsupervised learning. In ICML, 2012. 1, 2 [24] Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. Learning hierarchical invariant spatio-temporal features for action

[25]

[26]

[27]

[28]

[29] [30] [31] [32]

[33]

[34]

[35]

[36]

[37]

[38] [39]

[40]

[41] [42]

[43] [44]

recognition with independent subspace analysis. In CVPR, 2011. 2 Y. LeCun, B. Boser, J. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Handwritten digit recognition with a back-propagation network. In NIPS, 1990. 1 H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML, 2009. 2 X. Liang, S. Liu, Y. Wei, L. Liu, L. Lin, and S. Yan. Computational baby learning. CoRR, abs/1411.2861, 2014. 3, 6 S. Liu, X. Liang, L. Liu, X. Shen, J. Yang, C. Xu, X. Cao, and S. Yan. Matching-cnn meets knn: Quasi-parametric human parsing. In CVPR, 2015. 2 D. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. IJCV, 60(2):91–110, 2004. 1 P. Luo, X. Wang, and X. Tang. Hierarchical face parsing via deep learning. In CVPR, 2012. 2 H. Mobahi, R. Collobert, and J. Weston. Deep learning from temporal coherence in video. In ICML, 2009. 1, 2 B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311–3325, 1997. 2 M. A. Ranzato, F. J. Huang, Y.-L. Boureau, and Y. LeCun. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In CVPR, 2007. 2 O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge. IJCV, 2015. 2 B. C. Russell, A. A. Efros, J. Sivic, W. T. Freeman, and A. Zisserman. Using multiple segmentations to discover objects and their extent in image collections. In CVPR, 2006. 2 P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun. Pedestrian detection with unsupervised multi-stage feature learning. In CVPR, 2013. 2 N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from RGBD images. In ECCV, 2012. 8 S. Singh, A. Gupta, and A. A. Efros. Unsupervised discovery of mid-level discriminative patches. In ECCV, 2012. 2 J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman. Discovering objects and their location in images. In ICCV, 2005. 2 N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using lstms. CoRR, abs/1502.04681, 2015. 2 N. Srivastava and R. R. Salakhutdinov. Multimodal learning with deep boltzmann machines. In NIPS, 2012. 1, 2 E. B. Sudderth, A. Torralba, W. T. Freeman, and A. S. Willsky. Describing visual scenes using transformed dirichlet processes. In NIPS, 2005. 2 Y. Tang, R. Salakhutdinov, and G. Hinton. Robust boltzmann machines for recognition and denoising. In CVPR, 2012. 2 G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolutional learning of spatio-temporal features. In ECCV, 2010. 1, 2

[45] P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, 2008. 2 [46] H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013. 4 [47] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu. Learning ?ne-grained image similarity with deep ranking. In CVPR, 2014. 2, 3 [48] X. Wang, D. F. Fouhey, and A. Gupta. Designing deep networks for surface normal estimation. In CVPR, 2015. 8 [49] M. Weber, M. Welling, and P. Perona. Unsupervised learning of models for recognition. In ECCV, 2000. 2 [50] W. Y. Zou, S. Zhu, A. Y. Ng, and K. Yu. Deep learning of invariant features via simulated ?xations in video. In NIPS, 2012. 1, 2


相关文章:
更多相关标签: