Pattern Recognition 44 (2011) 1777–1784
Contents lists available at ScienceDirect
Pattern Recognition
journal homepage: www.elsevier.com/locate/pr
Sparse regularization for semi-supervised classi?cation
Mingyu Fan a, Nannan Gu b, Hong Qiao b, Bo Zhang a,?
a b
LSEC and Institute of Applied Mathematics, AMSS, Chinese Academy of Sciences, Beijing 100190, China Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
a r t i c l e i n f o
Article history: Received 21 June 2010 Received in revised form 29 September 2010 Accepted 15 February 2011 Available online 22 February 2011 Keywords: Regularization theory Semi-supervised learning Regularized least square classi?cation Dimensionality reduction
abstract
Manifold regularization (MR) is a promising regularization framework for semi-supervised learning, which introduces an additional penalty term to regularize the smoothness of functions on data manifolds and has been shown very effective in exploiting the underlying geometric structure of data for classi?cation. It has been shown that the performance of the MR algorithms depends highly on the design of the additional penalty term on manifolds. In this paper, we propose a new approach to de?ne the penalty term on manifolds by the sparse representations instead of the adjacency graphs of data. The process to build this novel penalty term has two steps. First, the best sparse linear reconstruction coef?cients for each data point are computed by the l1-norm minimization. Secondly, the learner is subject to a cost function which aims to preserve the sparse coef?cients. The cost function is utilized as the new penalty term for regularization algorithms. Compared with previous semi-supervised learning algorithms, the new penalty term needs less input parameters and has strong discriminative power for classi?cation. The least square classi?er using our novel penalty term is proposed in this paper, which is called the Sparse Regularized Least Square Classi?cation (S-RLSC) algorithm. Experiments on real-world data sets show that our algorithm is very effective. & 2011 Elsevier Ltd. All rights reserved.
1. Introduction Regularization theory was originally introduced to solve ill-posed inverse problems [18]. In the past decades, regularization has shown great power and been applied in many areas of machine learning, such as regression, clustering, classi?cation and model selection [10]. Many state-of-art machine learning algorithms, including Support Vector Machines (SVMs) [19], Regularized Neural Networks (RNNs) [14] and Regularized Least Square Classi?er (RLSC) [17], can be derived from the regularization framework. Recently, in [4], Belkin et al. proposed a general Manifold Regularization (MR) framework for a full range of learning problems from unsupervised, semi-supervised to supervised. The framework was developed in the setting of Reproducing Kernel Hilbert Spaces (RKHS), and a new Representor theorem was obtained in this setting for the regularization framework. In contrast to the traditional regularization theory, which concentrates on the complexity of functions in the functional space, the MR framework supplements an additional penalty term to the traditional regularization based on the assumption that data lie on an intrinsic low-dimensional manifold. The additional penalty term is used to measure the smoothness of functions on data manifolds, which will be referred to as the (penalty) term on manifolds for short. Such a term can
? Corresponding author. Tel.: + 86 10 6265 1358.
improve the performance of the obtained learner by exploiting the intrinsic structure of data. The MR algorithms, including the Laplacian Regularized Least Square Classi?cation (LapRLSC) and the Laplacian SVM (LapSVM) methods [4], have been shown especially useful and ef?cient in semi-supervised learning problems when both labeled examples and unlabeled examples are available for learning. Many semi-supervised learning methods can be uni?ed in the MR framework. The Discriminatively Regularized Least Square Classi?cation (DRLSC) method builds the penalty term on manifolds by integrating both discriminative and geometrical information in each local region [23]. Although the method is proposed as a supervised learning method, it can be applied to semi-supervised classi?cation problems. The MR framework can also unify many of the graph-based semi-supervised learning algorithms by ignoring the complexity of functions, which only have the penalty term on manifolds in the framework. Zhu et al. proposed a semi-supervised learning method called the Gaussian ?elds and harmonic functions (GFHF) method, based on a Gaussian random ?eld model [24]. Wang and Zhang proposed a semi-supervised learning algorithm by using the local linear reconstruction coef?cients, which is similar to the GFHF method [20]. Despite the success of these semi-supervised classi?cation methods, there are still some issues that have not yet been properly addressed. In particular, (1) Neighbors selection. Many graph-based methods, including the MR framework, de?ne the adjacency graphs by using a ?xed neighborhood size for all the data points. However, a ?xed
E-mail addresses: fanminyu@amss.ac.cn (M. Fan), gunannan@gmail.com (N. Gu), hong.qiao@ia.ac.cn (H. Qiao), b.zhang@amt.ac.cn (B. Zhang). 0031-3203/$ - see front matter & 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2011.02.013
1778
M. Fan et al. / Pattern Recognition 44 (2011) 1777–1784
neighborhood size causes the dif?culty of parameter selection and cannot be adaptive to uneven data. (2) Manifold assumption. Many graph-based methods, including the MR framework, assume that high-dimensional data distribute on a low-dimensional manifold. However, for many types of data, we lack convincing evidence for the manifold structure. (3) Explicit classi?er for new points. Some graph-based methods do not have an explicit multi-class classi?er for novel examples, which limits their application in on-line decision making tasks. To address the above issues, we propose the sparse regularization (SR) approach for semi-supervised learning. A novel penalty term is de?ned using the sparse representation [22] of the data. With the novel penalty term, the approach can derive classi?ers in the MR framework. Therefore, the proposed SR approach not only inherits the advantages of fewer parameters and highly discriminative ability from the sparse representation, but also has a natural out-of-sample extension for novel examples, which is inherited from the MR framework. Experiments on real-world data sets demonstrate the effectiveness and highly discriminative ability of our approach. The rest of this paper is organized as follows. Some previous works are introduced in Section 2. The proposed SR approach and the derived Sparse Regularized Least Square Classi?cation (S-RLSC) algorithm are presented in Section 3. Then in Section 4, experiments on benchmark real-world data sets are reported. Finally, we conclude this paper in Section 5. 2. Previous works In a general semi-supervised classi?cation problem, the training data set is represented as f?xi ,zi ?,xl ? j ,i ? 1, . . . ,l,j ? 1, . . . ,ug, where l is the number of labeled data points, u is the number of unlabeled data points, xi A RN is a data point and zi A f?1,1g is the class label of xi. 2.1. Regularization on explicit functions Belkin et al. [4] proposed the MR framework based on the theory of RKHS. Assuming that f is a real-valued function in the RKHS HK , the MR framework can be expressed in the form ( ) l 1X ?1? V?xi ,zi ,f ? ? gA Jf J2 ? gI Jf J2 , f n ? argmin K I l i?1 f A HK where V is some loss function, Jf J2 is the norm of the function in K HK which controls the complexity of the classi?er and Jf J2 is I the penalty term to regularize the smoothness of the function on manifolds. If Jf J2 ? I 1 ?l ? u?
2 i,j ? 1 l?u X
and Jf J2 is de?ned by (2), then the Laplacian Regularized Least I Square Classi?er (LapRLSC) can be obtained; if V is chosen as the hinge loss function 1?zi f ?xi ? if zi f ?xi ? 40 V?xi ,zi ,f ? ? 0 otherwise and Jf J2 is again given as in (2), then the Laplacian Regularized I Support Vector Machines (LapSVMs) can be obtained (see [4]). By using the square loss function as the loss function V and by making the best use of the underlying discriminative and geometrical information of the data manifold to de?ne the penalty term Jf J2 , I a new MR algorithm called the DRLSC algorithm was obtained in [23] from the MR framework (1). Although the DRLSC algorithm was proposed as a supervised learning algorithm, it is similar with the LapRLSC algorithm and can be used as a semi-supervised learning algorithm. 2.2. Regularization on implicit functions If the parameter gA is set to be zero, then the second term of the MR framework (1) that controls the complexity of the classi?er vanishes. As a result, the feasible function f in (1) is not restricted to being in the RKHS HK . In fact, the feasible function f can be any function; in particular, it can be required to be an unknown or implicit function satisfying that zi ? f ?xi ? for i ? 1, . . . ,l, so the error P part ?1=l? li ? 1 V?xi ,zi ,f ? vanishes. Thus, the MR framework has only the penalty term Jf J2 on manifolds, where f is an unknown or I implicit function satisfying that zi ? f ?xi ? for i ? 1, . . . ,l: For unlabeled data points xl ? i (i ? 1, . . . ,u) de?ne implicitly zl ? i ? f ?xl ? i ? for i ? 1, . . . ,u, which are unknown and regarded as the labels of the unlabeled data points xl ? i (or values of the function f at xl ? i ) P? (i ? 1, . . . ,u). If we further de?ne Jf J2 ? ?1=2? li,j u wij ?zi ?zj ?2 , I where wij are edge weights in the data adjacency graph as de?ned in Subsection 2.1, then the labels zl ? i of unlabeled data points can be computed by minimizing Jf J2 . Therefore, the minimization of I the penalty term on manifolds can also give new semi-supervised learning algorithms. Belkin and Niyogi proposed a manifold learning based classi?er [3], which is built by the eigenvectors of the Laplacian matrix. Zhu et al. introduced the GFHF method based on a random ?eld model [24]. The GFHF method is de?ned on a weighted graph superimposed on the whole data set, which comprises both labeled and unlabeled data points. The pairwise similarities between the data points are de?ned as ! N X ?xik ?xjk ?2 wij ? exp ? , 2
k?1
sk
?f ?xi ??f ?xj ??2 wij ,
?2?
where wij are edge weights in the data adjacency graph, then it follows by the Representer Theorem [4, Theorem 2] that the solution of the optimization problem (1) admits the representation f n ?x? ?
u?l X i?1
where xik is the k-th component of the data point xi and sk is the length-scale hyper-parameter for the k-th component. Let W ? ?wij ? be the (l +u) ? (l+ u) similarity matrix, let D P be the diagonal matrix of order l +u with Dii ? lj ? u1 wij and ? let L ? D?W. Then the GFHF method minimizes the quadratic function E?Z? ? 1X w ?z ?z ?2 ? Z T LZ, 2 i,j ij i j ?4?
ai k?xi ,x?
?3?
in terms of the labeled and unlabeled samples, where k??,?? is some Mercer kernel function associated with the RKHS HK . For different choices of loss function V and Jf J2 , different MR algorithms can be I derived from the MR framework (1). For example, if the loss function V is de?ned to the square loss function V?x,z,f ? ? ?z?f ?x??2
where Z ? ?z1 , . . . ,zl ,zl ? 1 , . . . ,zl ? u ?T with zl ? i ? f ?xl ? i ?, i ? 1, . . . ,u: The similarity matrix W (and also the diagonal matrix D) can be split into four blocks: ! Wll Wlu W? Wul Wuu :
T Assume that Z ? ?ZlT Zu ?T , where Zl ? ?z1 , . . . ,zl ?T and Zu ? T ?zl ? 1 , . . . ,zl ? u ? . Suppose Z minimizes the function in (4). Then
M. Fan et al. / Pattern Recognition 44 (2011) 1777–1784
1779
Zu can be computed as follows: Zu ? ?Duu ?Wuu ?
?1
Wul Zl :
Similarly, Wang and Zhang [20] proposed to minimize the function in (4) with different pairwise similarities wij. They ?rst compute the locally linear reconstruction coef?cients by
2
X
Mi ? argmin
xi ? Mij xj
i j A NK X s:t: Mij ? 1, Mij Z 0,
j
Although originally proposed for compression and encoding signals, it is naturally discriminative [21] and thus can be used to solve classi?cation tasks. Given suf?cient data points, any data point from the data set approximately lies in the linear span of the other data points, which can be described as xi ? X ai , i ? 1, . . . ,l ?u,
Mij ? 0
if j= NK , 2 i
where X ? ?x1 , . . . ,xl , . . . ,xl ? u ? A RN??l ? u? is the training data matrix, ai ? ?ai1 , . . . , aii?1 ,0, aii ? 1 , . . . , ail ? u ?T A Rl ? u is the reconstruction coef?cients. Then the sparse representation of xi can be found via the l0-minimization problem as
which can be easily solved by a quadratic optimization problem. Then the pairwise similarities wij are given as wij ? Mij ? Mji . The subsequent classi?cation procedure of the method is the same with the GFHF method.
an ? argminJai J0 s:t: xi ? X ai , i
ai
?5?
3. Sparse regularization Instead of minimizing the smoothness of functions on data manifolds, we propose to keep the strong discriminative ability of sparse representation for data by minimizing a cost function. In this section, we ?rst present the sparse representation method used in this paper and then give the Sparse Regularization (SR) method and the Sparse Regularized Least Square Classi?cation (SRLSC) algorithm. In order to avoid confusion, we give a list of the main notations used in this paper in Table 1. Throughout this paper, all data points and the corresponding label vectors are in the form of column vectors and denoted by lowercases. All sets are represented by capital curlicue letters. Matrices are denoted by normal capital letters. 3.1. Sparse representation of data Sparse representation [2] refers to the representation that accounts for most or all information of a signal within a linear combination of a small number of elementary signals called basis.
where J ? J0 denotes the l0-norm which is de?ned to be the number of nonzero entries in a vector. The solution an , which is i the sparse representation of xi, describes how to express xi with a linear combination of the smallest quantity of the training data points. It has been shown that the training data points of the same class associated with the tested one are preferred in the sparse sense. Therefore, using the training data points as basis elements, the sparse representation of the tested data point has strong discriminative ability. However, this problem (5) is proved to be an NP-hard problem [1]. Fortunately, theoretical results show that if the solution an obtained is sparse enough then the solution of the l1 i and l0 minimization problems are equivalent [12,6,7]. Therefore, the sparse representation problem can also be addressed by the following l1-optimization problem:
an ? argminJai J1 s:t: xi ? X ai : i
ai
Taking account of the effect of noise, insuf?cient training points or outliers, the optimization problem can be formulated as
an ? argmin?Jai J1 ? lJei Jq ? s:t: xi ? X ai ? ei , p i
ai
?6?
Table 1 Notations.
RN
X C Y
The input N-dimensional Euclidean space X ? ?x1 , . . . ,xl , . . . ,xl ? u ? A RN??l ? u? is the training data matrix. fxi gli ? 1 are labeled points, and are unlabeled points The number of classes that the samples belong to Y ? ?y1 , . . . ,yl ,0, . . . ,0? A RC??l ? u? is the 0–1 label matrix. yi A RC is the label vector of xi, and all components of yi are 0 s except one being 1 fxi gli ? u ? 1 ?l
where ei ? ?ei1 , . . . ,eiN ?T A RN is the error vector, Jei Jq ? p P ? N? 1 jeij j1=p ?q , p and q are positive integers. j The problem (6) is a general approach of constructing the sparse representation of data. If p ? 1=2 and q?2, then the traditional lasso method can be obtained, while if p ?q?1 and l ? 1 then the l1-graph method of Wright et al. [22] can be obtained. For convenience, in this paper we set p ?q?1 and l ? 1. This gives the l1-graph method:
an ? argmin?Jai J1 ? Jei J1 ? s:t :xi ? X ai ? ei : i
ai
?7?
^ Let yi ? ?aT eT ?T and X ? ?XI?, where I is an N ? N identity matrix. i i Then the proposed l1-minimization problem (7) can be transformed to ^ yn ? argminJyi J1 s:t: xi ? X yi : i
yi
Z F???
Z ? ?z1 ,z2 , . . . ,zl ?T A Rl is the label vector with zi the label of xi, where zi A f?1,1g if C ?2 F?x? ? ?f1 ?x?, . . . ,fC ?x??T is the discriminative vector function. The index of the class which x belongs to is that of the component with the maximum value. The classi?er function with g?x? A f1,2, . . . ,Cg being the index of the class that x comes from Kernel function of variables u and v Kernel matrix K ? fk?xi ,xj ?gA R
?l ? u???l ? u?
?8?
This minimization problem can be easily solved using convex programming [5,8]. 3.2. Sparse regularization and S-RLSC algorithm In our method, yi is a C-dimensional label vector with the elements 0 or 1, where C is the number of classes. If xi belongs to the k-th class, then the k-th component of yi takes the value 1 and the rest components take the value 0. Our discriminative vector function F??? is de?ned as F?x? ? ?f1 ?x?,f2 ?x?, . . . ,fC ?x??T ,
g??? k?u,v? K B J ? J0 J ? J1 J ? J2
B ? ?b1 , . . . ,bl ? u ?A RC??l ? u? . Its columns are the coef?cients of the kernel function to represent the discriminative function F??? l0 norm where JwJ0 counts the number of nonzero entries for a vector w A Rm P l1 norm where JwJ1 ? m 1 jwi j for a vector w A Rm i? q???????????????????? Pm m 2 l2 norm where JwJ2 ? i ? 1 wi for a vector w A R
1780
M. Fan et al. / Pattern Recognition 44 (2011) 1777–1784
where fs ?x? ?
u?l X i?1
bsi k?xi ,x?,
s ? 1, . . . ,C:
The class of x is determined as that of the component which takes the maximum value of fs(x), s ? 1,2, . . . ,C. Many existing semi-supervised learning algorithms are based on the manifold assumption that the data points distribute on an intrinsic low-dimensional manifold and that if two data points xi, xj are close on the manifold then their labels are the same similar, that is, the label function varies smoothly on the manifold. However, this assumption is very restriction and brings an extra input parameter, the neighborhood size. On the other hand, it is clear that the sparse representing coef?cients aij of the data point xi characterizes how the rest points contribute to the sparse representation of xi and, therefore, is naturally discriminative. Thus, it is meaningful to require that the discriminative function on the data points keeps the sparse representing coef?cients in the square loss sense. Hence, the label of a data point will be reconstructed by the labels of the other data points using the sparse representing coef?cients. Based on this, we de?ne the penalty term in this section to measure the error of the discriminative function F in preserving the sparse representing coef?cients in the average sense as follows: JFJ2 ? I 1
l?u X
where gA and gI are the given regularization parameters and V is some loss function. For convenience, here we set V?x,y,f ? ? Jy?F?x?J2 . Then the 2 S-RLSC method is given as follows: ( ) l 1X 2 2 2 n F ? argmin Jyi ?F?xi ?J2 ? gA JFJK ? gI JFJI : ?12? fs A HK ,s ? 1,...,C l i ? 1 By matrix computation, the Eq. (9) can be expressed in the matrix form JFJ2 ? I where A ? ?a1 , . . . , al ? u ?T A R?l ? u???l ? u? , H ? ?F?x1 ?, . . . ,F?xl ? u ?? A RC??l ? u? , and tr(M) denotes the trace of the matrix M, that is, the sum of the diagonal elements of the matrix M. On the other hand, the discriminative vector function can be written as F n ?x? ?
l?u X i?1
1 ?l ? u?2
tr?H?I?A?T ?I?A?HT ?,
?13?
bi k?xi ,x?,
?14?
?l ? u?2 i ? 1
JF?xi ??
l?u X j?1
aij F?xj ?J2 , 2
?9?
where bi ? ?b1i , . . . ,bci ?T . Let B ? ?b1 , . . . ,bl ? u ? A RC??l ? u? be the matrix of coef?cients bi and let K ? ?k?xi ,xj ?? A R?l ? u???l ? u? be the kernel matrix. Then we have JFJ2 ? K
C X i?1
where aij is the j-th element of ai . For a given Mercer kernel function k?u,v?, there is an associated reproducing kernel Hilbert space (RKHS) HK of functions. It can be constructed by considering the space of ?nite linear combinations P i Zi k?ui ,?? of kernels and taking completion with respect to the inner product given by /k?u,??,k?v,??SK ? k?u,v?. Thus, for any P P functions f ,g A HK , if f ? i Zi k?ui ,??,g ? i ri k?ui ,?? then the inner product of f,g and the square of the norm of f can be expressed, respectively, as X /f ,gSK ? Zi rj k?ui ,vj ?,
i,j
Jfi J2 ? K
C X i?1
bT Kbi ? tr?BKBT ?: i
?15?
Assume that Y ? ?y1 , . . . ,yl ,0, . . . ,0? A RC??l ? u? is the label matrix with elements 0 or 1, where 0 A RC is the zero vector, and J A R?l ? u???l ? u? is a diagonal matrix with the ?rst l diagonal elements being 1 and the rest being 0. By substituting (14), (13) and (15) into (12), the following convex optimization problem can be obtained: ( 1 Bn ? argmin tr??Y?BKJ??Y?BKJ T ?? ? gA tr?BKBT ? l B A R?l ? u??C ) ?
gI
Jf J2 K
?
X
i,j
?l ?u?2
tr?BK?I?A?T ?I?A?KBT ? :
?16?
Zi Zj k?ui ,uj ?:
In the regularization theory, Jf J2 can be used as a measure of the K complexity in RKHS of the function. For vector functions F?x? ? ?f1 ?x?,f2 ?x?, . . . ,fC ?x??T , G?x? ? ?g1 ?x?,g2 ?x?, . . . ,gC ?x??T , their inner product can be de?ned as /F,GSK ?
C X i?1
The solution of the optimization problem (16) is given by !?1 gI l K?I?A?T ?I?A? , Bn ? Y KJ ? gA lI ? ?l ? u?2
?17?
where I is the identity matrix. Then the solution of the problem (12) is
n n n F n ?x? ? ?f1 ?x?,f2 ?x?, . . . ,fC ?x??T ?
l?u X i?1
bn k?xi ,x? i
?18?
with Bn ? ?bn , . . . ,bn? u ?. Thus, the S-RLSC classi?er is obtained as 1 l g n ?x? ? in ? argmax ffin ?x?g:
i ? 1,2,...,C
/fi ,gi SK :
Then the regularization term of F, which measures the complexity of the classi?er in the ambient space, can be expressed as JFJ2 ? K
C X i?1
Based on the above, the corresponding S-RLSC algorithm can be summarized as follows. Algorithm 1. The S-RLSC algorithm
Jfi J2 : K
?10?
Similarly as in the MR framework (1), making use of (9) and (10) to be the penalty term and the regularization term for the discriminative function F, respectively, we have the sparse regularization method as follows: ( ) l 1X F n ? argmin V?xi ,yi ,F? ? gA JFJ2 ? gI JFJ2 , ?11? K I fs A HK ,s ? 1,...,C l i ? 1
Input: Data set f?xi ,yi ?,xl ? j ,i ? 1, . . . ,l,j ? 1, . . . ,ug, regularization parameters gA , gI and the kernel parameter s. 1: Normalize the columns of X to have the unit l2-norm. 2: Compute the best sparse representing coef?cients for each point in fxi gli ? u by solving the l1-norm ?1 minimization problem (8). 3: Compute the kernel matrix K ? ?k?xi ,xj ??.
M. Fan et al. / Pattern Recognition 44 (2011) 1777–1784
1781
4: Compute the coef?cient matrix Bn of the kernel matrix K by the Eq. (17). 5: Compute the discriminative function F n ?x? by (18). Output: The ?nal classi?er g n ?x? ? in ? argmaxi ? 1,2,...,C ffin ?x?g.
3.3. Contributions The proposed SR approach and the derived S-RLSC algorithm have three desirable features: less parameters, highly discriminative ability and a natural out-of-sample extension for new points, as discussed as follows: (1) Adaptively establishing the relationship between data points. Based on the sparse representation theory, the relationship between each data point and the rest data points can be automatically obtained by solving an l1-norm minimization problem, while the traditional methods, which always use a ?xed neighborhood size (e.g. the k-nearest-neighbor or the e-nearest-neighbor) to construct the relationship, cannot be adaptive to uneven data. (2) Highly discriminative ability. The l1-minimization problem ?nds a sparse representation for each data point. The represented data point and the representing data points are naturally in the same class. This desirability works well in a high-dimensional space and is related to the number of classes but not the number of data points. (3) A natural out-of-sample extension for new data points. Our algorithm not only inherits the highly discriminative ability of sparse representation of data but also has a natural out-ofsample extension for newly coming data points. This property allows much faster classi?cation speed compared with the sparse classi?cation and has a better performance than the MR algorithms.
data set is a 16 ? 16 image of a handwritten digit and is transformed into a 256-dimensional vector. For each class, we randomly selected 250 data points for our experiments. Therefore, our USPS data set contains 2500 data points. Some sample data points of class ‘1’ from this data set are shown in Fig. 3. The Extended Yale B face database contains 2114 frontal-face images of 38 individuals. Each data point is a cropped 64 ? 64 gray-scale face image and was captured under various lighting conditions. For each class, there are about 60 images, and each image is stacked to a 1024-dimensional data vector. We use the whole database as one of our experimental data sets. Fig. 4 presents some sample face images of two individuals from the data set. 4.2. Parameter selection and experimental settings LapRLSC and GFHF methods need the neighborhood size k as a key parameter for implementation. Experiments show that, for each data set, the results are stable with a wide range of neighborhood sizes. In our experiments, the neighborhood size k is set to be 7 for all the four data sets.
Fig. 1. Some image samples from the CBCL data set. The ?rst two rows show some face images and the last two rows show some non-face images.
4. Experiments To evaluate the performance of the S-RLSC algorithm, we perform experiments on four real-world data sets: the MIT CBCL data set [11], the Intelligent Traf?c System (ITS) data set [9], the Extended Yale B data set [13,16] and the USPS handwritten digit data set [15]. Comparison is made with four important classi?cation methods: LapRLSC method [4], Gaussian ?elds and harmonic functions (GFHF) method [24], the sparse representation-based classi?cation (SRC) [21] and the 1-nearest-neighbor (1-NN) algorithm. 4.1. Data set description The MIT CBCL database contains 2429 face images and 4548 non-face images. Each image has 19 ? 19 pixels and is transformed into a 361-dimensional vector. This database contains two classes of data points, face and non-face. In the experiment, we used a subset of this database which consists of 1000 face and 1000 non-face images. Fig. 1 shows some face and non-face images from this data set. The ITS data set contains 2000 images of two classes: 1000 images of humans walking or running and 1000 images of a common scene on road without human activity. Each data point is a cropped 32 ? 16 gray-scale image and is transformed into a 512-dimensional vector. Some samples of both human walking images and the non-human images are presented in Fig. 2. The USPS database contains 8-bit gray-scale images of classes ‘0’–‘9’ with 1100 data points of each class. Each data point of this
Fig. 2. Some image samples from the ITS data set. The ?rst two rows show some human walking images and the last two rows show some non-human images.
Fig. 3. Some image samples from the USPS handwritten data set of class ‘1’.
Fig. 4. Some face image samples of two individuals from the Extended Yale B data set.
1782
M. Fan et al. / Pattern Recognition 44 (2011) 1777–1784
The LapRLSC and S-RLSC methods have regularization parameters gA and gI . Let CA ? gA l, CI ? gI l=?l ? u?2 and let the kernel function k?x,y? ? exp??Jx?yJ2 =s2 ?. In the experiments it was found that both algorithms perform well with a wide range of regularization parameters. In our experiments, we set CA ?0.005, CI ?0.01 and s ? 0:5 for all four data sets. The SRC method needs only one parameter, error tolerance e. In [21], this value was set to be 0.05 throughout all their experiments. Therefore, we also used this value for the error tolerance e in our experiments. For each data set X , we ?rst rearrange the order of data points randomly. Then, in each class of X , 15% of the data points are left for out-of-sample extension experiment. Denote by X 1 the rest data points of the data set X . In each class of X 1 , we randomly label m data points to train the algorithms. For the SRC and 1-NN classi?ers, the training set comprises only the labeled data points from X 1 . For the GFHF, LapRLSC and S-RLSC classi?ers, the training set consists of the whole X 1 , including the labeled and the unlabeled data points. After obtaining the classi?ers, classi?cation was performed ?rst on the unlabeled data points in X 1 and then on the out-of-sample extension data points. It should be noted that in all the experiment results the classi?cation accuracy is lower for a small number of labeled training data samples (e.g. m?5,10) for all algorithms, which should be natural due to the small number of training samples.
ITS data set 100 95 90 85 Accuracy % 80 75 70 65 60 55 50 0 100 200 300 m
Fig. 6. Classi?cation results of the unlabeled data points in X 1 for the ITS data set, where m is the number of labeled data points in each class.
S?RLSC 1?NN GFHF SRC LapRLSC
400
500
600
700
100 90 80 Accuracy % 70 60 50 40 30 20 10 0 0 20 40 60
USPS data set
4.3. Experimental results For the CBCL and ITS data sets, which have data points of twoclass, the number m of labeled data points (in each class) changes from 5 to 650. The recognition accuracy of the algorithms on the unlabeled data points of X 1 are shown in Figs. 5 and 6, respectively. As can be seen from Figs. 5 and 6, the S-RLSC algorithm clearly outperforms the other four methods. In particular, for the CBCL data set, in the case when m Z500, the recognition rate of the S-RLSC algorithm is over 99%. Note that the LapRLSC algorithm has also better recognition results than the GFHF, SRC and 1-NN algorithms. This may be because the LapRLSC algorithm takes account of the smoothness of the classi?er function on the manifold. It should also be noted that the recognition result of the SRC algorithm is not very satisfactory when m is relatively small. The reason may be that SRC only uses the labeled points for training, but it is based on sparse representation which may become invalid for insuf?cient training samples. The recognition result of the algorithms is shown in Figs. 7 and 8, respectively, on the unlabeled data points of X 1 for the USPS and
S?RLSC 1?NN GFHF SRC LapRLSC
80
100 120 140 160 180 200 m
Fig. 7. Classi?cation results of the unlabeled data points in X 1 for the USPS data set, where m is the number of labeled data points in each class.
100 90 80 Accuracy % 70 60 50 40
Extended Yale?B data set
100 90 Accuracy % 80 70 60 50 40 0 100 200
MIT CBCL data set
S?RLSC 1?NN GFHF SRC LapRLSC
S?RLSC 1?NN GFHF SRC LapRLSC
30 5 10 15 20 25 m
Fig. 8. Classi?cation results of the unlabeled data points in X 1 for the Extended Yale B data set, where m is the number of labeled data points in each class.
30
35
40
45
50
300 m
400
500
600
700
Fig. 5. Classi?cation results of the unlabeled data points in X 1 for the CBCL data set, where m is the number of labeled data points in each class.
Extended Yale B data sets. Since the data points of these two data sets belong to more than two classes, the multi-class LapRLSC method is implemented on them, where the label of xi is changed to the vector form. The number m (in each class) of the labeled data points changes from 5 to 190 for the USPS data set and from 5 to 50
M. Fan et al. / Pattern Recognition 44 (2011) 1777–1784
1783
Table 2 Out-of-sample extension results on the CBCL data set. Method m ?5 m ?10 m ?50 m ?250 m ?450 m ?650 S-RLSC (%) 74.00 75.67 95.00 99.00 99.33 98.67 LapRLSC (%) 52.33 74.66 94.67 98.00 98.67 98.67 SRC (%) 44.33 58.33 67.33 93.33 95.33 96.00 1-NN (%) 65.67 70.67 88.67 94.67 96.00 96.67 GFHF (%) 86.33 81.00 92.00 94.67 94.33 94.67
Var = 0
Var = 0.05 σ
Var = 0.1 σ
Fig. 9. Some samples of the noised Extended Yale-B images.
Table 3 Out-of-sample extension results on the ITS data set. Method m ?5 m ?10 m ?50 m ?250 m ?450 m ?650 S-RLSC (%) 87.94 88.49 91.31 94.78 96.31 96.82 LapRLSC (%) 74.21 64.47 82.89 91.56 93.68 94.74 SRC (%) 57.89 53.16 65.26 72.37 87.11 90.26 1-NN (%) 80.62 80.52 78.42 88.68 90.26 91.84 GFHF (%) 74.47 77.63 78.94 87.11 88.68 91.05
Table 6 Classi?cation results on the noised Yale B data set. S-RLSC (%) Var? 0 m? 5 m? 10 m? 25 LapRLSC (%) SRC (%) 1-NN (%) GFHF (%)
72.21 84.86 95.45
55.75 72.79 93.21 24.34 40.12 52.94 6.45 21.92 58.73
44.08 62.53 84.04 3.55 2.18 7.45 4.88 2.48 2.23
39.29 55.00 73.70 23.05 34.55 46.67 9.76 12.82 16.63
40.69 52.60 72.70 26.90 39.88 50.59 11.91 15.05 20.01
Var ? 0:05s m? 5 55.96 m? 10 56.97 m? 25 84.31 Var ? 0:1s m? 5 12.49 m? 10 38.71 m? 25 58.89
Table 4 Out-of-sample extension results on the USPS data set. Method m ?5 m ?10 m ?40 m ?100 m ?160 S-RLSC (%) 72.63 86.58 95.00 98.16 98.68 LapRLSC (%) 72.63 85.00 93.68 97.36 98.94 SRC (%) 5.26 9.47 82.11 92.11 93.68 1-NN (%) 61.84 73.16 87.11 91.01 94.74 GFHF (%) 68.94 79.74 88.41 93.74 94.74
Table 7 Out-of-sample extension results on the noised Yale B data set. S-RLSC (%) LapRLSC (%) SRC (%) 1-NN (%) GFHF (%)
Table 5 Out-of-sample extension results on the Yale B data set. Method m ?5 m ?10 m ?20 m ?30 m ?40 m ?50 S-RLSC (%) 65.41 84.18 94.10 96.24 98.12 98.12 LapRLSC (%) 48.25 71.05 85.52 91.42 95.44 97.32 SRC (%) 36.46 58.17 76.94 90.61 96.51 98.39 1-NN (%) 33.78 54.69 68.90 77.21 79.89 83.38 GFHF (%) 21.18 35.65 55.22 71.58 80.70 74.72
Var? 0 m? 5 m? 10 m? 25
71.43 85.09 95.29
48.07 62.54 83.13 35.48 58.64 84.28 7.88 12.36 22.75
40.09 57.81 81.57 3.23 2.98 8.35 4.53 1.09 3.13
39.01 52.24 74.90 22.25 32.75 46.48 8.47 12.48 15.29
37.54 44.73 59.60 27.63 39.62 56.41 11.52 15.63 20.02
Var ? 0:05s m? 5 52.44 m? 10 56.99 m? 25 85.36 Var ? 0:1s m? 5 8.28 m? 10 29.81 m? 25 48.23
for the Extended Yale B data set. It can be seen from Figs. 7 and 8 that the S-RLSC algorithm has the best recognition result among all the algorithms except for the USPS data set in the case when m?5 and 10 and the Extended Yale B data set in the case when m?50. In addition, for relatively large m the result of the SRC algorithm is satisfactory on the two data sets. The out-of-sample extension result of the algorithms on the four data sets is shown in Tables 2–5 for several values of m, where the best classi?cation results are in boldface for each ?xed value of m. As can be seen from the tables, the discriminative ability of the S-RLSC algorithm is much better than the other algorithms.
4.4. Data sets with noise In order to evaluate the performance of the S-RLSC algorithm on noise data, the ?ve algorithms are implemented on the Extended Yale-B data set which is corrupted by noise of three different levels. At each level, each data point is corrupted by adding i.i.d. noise from a uniform distribution with zero mean to the pixels of the data point. The variance of the added noise varies from 0, 0:01 ? s to 0:1 ? s, where s ? 2630 is the mean of the pairwise Euclidean
distances of the data set (before normalization). Three sample images of a data point are presented in Fig. 9, which are corrupted by noise of three different levels as described above. In these experiments, we use 50 percents of the whole data set as the training set and the rest 50 percents as the test set. For each test, we label m?5, 10 and 25 data points in the training set to train the algorithms. The setting of the parameters is the same as in the previous experiments. As seen from Tables 6 and 7, the performance of the ?ve algorithms deteriorates rapidly with the level of noise. The S-RLSC algorithms has the best performance than the other four algorithms. It should be noted that the SRC algorithm has the worst performance on the noised Yale-B data set. There may be two reasons for this phenomenon. First, the noise corrupts all the pixels of each image which is not sparse; this is different from [21], where the noise is sparse and corrupts only a portion of certain randomly chosen pixels for each image. Secondly, the SRC algorithm is a supervised classi?cation method, which does not use the structure information of unlabeled data points to improve its classi?cation performance.
1784
M. Fan et al. / Pattern Recognition 44 (2011) 1777–1784
In summary, the above experimental results show the advantage of the proposed S-RLSC algorithm over the other four algorithms.
5. Conclusion In this paper, we have proposed a novel semi-supervised classi?cation algorithm called the S-RLSC algorithm. The S-RLSC algorithm assumes that the discriminative function which contains the label information the sparse representing coef?cients of data points. Comparison has been made of the S-RLSC algorithm with four important classi?cation algorithms: the LapRLSC, GFHF, SRC and 1-NN algorithms through experiments on four real-world data sets, and the results have shown that the S-RLSC algorithm has a better recognition result and stable performance with respect to the parameters compared with the other four algorithms.
Acknowledgments This work was partly supported by the NNSF of China Grant no. 90820007, the Outstanding Youth Fund of the NNSF of China Grant no. 60725310, the 863 Program of China Grant no. 2007AA04Z228 and the 973 Program of China Grant no. 2007CB311002. The authors thank the referees for their invaluable comments and suggestions which helped improve the paper greatly. References
[1] E. Amaldi, V. Kann, On the approximability of minimizing nonzero variables or unsatis?ed relations in linear systems, Theor. Comput. Sci. 209 (1998) 237–260. [2] A. Barron, J. Rissanen, B. Yu, The minimum description length principle in coding and modeling, IEEE Trans. Inf. Theory 44 (1998) 2743–2760. [3] M. Belkin, P. Niyogi, Semi-supervised learning on Riemannian manifolds, Mach. Learn. 56 (1–3) (2004) 209–239.
[4] M. Belkin, V. Sindhwani, P. Niyogi, Manifold regularization: a geometric framework for learning from labeled and unlabeled examples, J. Mach. Learn. Res. 7 (2006) 2399–2434. [5] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press, Cambridge, UK, 2004. [6] E. Candes, J. Romberg, T. Tao, Stable signal recovery from incomplete and inaccurate measurements, Comm. Pure Appl. Math. 59 (8) (2006) 1207–1223. [7] E. Candes, T. Tao, Near-optimal signal recovery from random projections: universal encoding strategies? IEEE Trans. Inf. Theory 52 (12) (2006) 5406–5425. [8] E. Candes, J. Romberg, l1-magic: recovery of sparse signals via convex programming, 2005, /http://www.acm.caltech.edu/l1magic/downloads/l1magic.pdfS. [9] X. Cao, H. Qiao, J. Keane, A low-cost pedestrian-detection system with a single optical camera, IEEE Trans. Intelligent Transport. Syst. 9 (1) (2008) 58–67. [10] Z. Chen, S. Haykin, On different facets of regularization theory, Neural Comput. 14 (12) (2002) 2791–2846. [11] CBCL Face Database 1, MIT Center For Biological and Computation Learning, / http://www.ai.mit.edu/projects/cbcl S. [12] D. Donoho, For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution, Comm. Pure Appl. Math. 59 (6) (2006) 797–829. [13] A.S. Georghiades, P.N. Belhumeur, D.J. Kriegman, From few to many: illumination cone models for face recognition under variable lighting and pose, IEEE Trans. Pattern Anal. Mach. Intell. 23 (6) (2001) 643–660. [14] F. Girosi, M. Jones, T. Poggio, Regularization theory and neural networks architectures, Neural Comput. 7 (2) (1995) 219–269. [15] J.J. Hull, A database for handwritten text recognition research, IEEE Trans. Pattern Anal. Mach. Intell. 16 (5) (1998) 550–554. [16] K.C. Lee, J. Ho, D. Kriegman, Acquiring linear subspaces for face recognition under variable lighting, IEEE Trans. Pattern Anal. Mach. Intell. 27 (5) (2005) 684–698. [17] T. Poggio, S. Smale, The mathematics of learning: dealing with data, Not. Am. Math. Soc. 50 (5) (2003) 537–544. [18] A.N. Tikhonov, Regularization of incorrectly posed problems, Sov. Math. Dokl. 4 (1963) 1624–1627. [19] V.N. Vapnik, Statistical Learning Theory, Wiley, New York, 1998. [20] F. Wang, C. Zhang, Robust self-tuning semi-supervised learning, Neurocomputing 70 (16–18) (2007) 2931–2939. [21] J. Wright, A. Yang, A. Ganesh, S.S. Sastry, Y. Ma, Robust face recognition via sparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2009) 210–227. [22] J. Wright, Y. Ma, J. Mairal, G. Spairo, T. Huang, S. Yan, Sparse representation for computer vision and pattern recognition, Proceedings of the IEEE 98 (6) (2010) 1031–1044. [23] H. Xue, S. Chen, Q. Yang, Discriminatively regularized least-squares classi?cation, Pattern Recognition 42 (1) (2009) 93–104. [24] X. Zhu, Z. Ghahramani, J. Lafferty, Semi-supervised learning using Gaussian ?elds and harmonic functions, in: The 20th International Conference on Machine Learning (ICML), 2003, pp. 912–919.
Mingyu Fan received the B.Sc. degree from the Central University for Nationalities, Beijing, China, in 2006. He is currently working toward the Ph.D. degree in applied mathematics in the Institution of Applied Mathematics, Academy of Mathematics and System Science, Chinese Academy of Science, Beijing, China. His current research interests are theory and application of manifold learning and nonlinear dimensionality reduction.
Nan-Nan Gu received the B.Sc. degree in information and computing science from Xi’an Jiaotong University, Xi’an, China in 2006, and the M.Sc. degree in applied mathematics from Xi’an Jiaotong University, Xi’an, China in 2009. Currently, she is working toward the Ph.D. degree in pattern recognition in the Institute of Automation, Chinese Academy of Sciences, Beijing, China. Her research interests include theory and application of manifold learning, nonlinear dimensionality reduction and classi?cation.
Hong Qiao (SM’06) received the B.E. degree in hydraulics and control and the M.E. degree in robotics from Xian Jiaotong University, Xian, China, the M.Phil. degree in robotics control from the Industrial Control Center, University of Strathclyde, Glasgow, U.K., and the Ph.D. degree in robotics and arti?cial intelligence from De Montfort University, Leicester, U.K., in 1995. She was a University Research Fellow with De Montfort University from 1995 to 1997. She was a Research Assistant Professor from 1997 to 2000 and an Assistant Professor from 2000 to 2002 with the Department of Manufacturing Engineering and Engineering Management, City University of Hong Kong, Kowloon, Hong Kong. Since January 2002, she has been a Lecturer with the School of Informatics, University of Manchester, Manchester, U.K. Currently, she is also a Professor with the Laboratory of Complex Systems and Intelligent Science, Institute of Automation, Chinese Academy of Sciences, Beijing, China. She ?rst proposed the concept of ‘‘the attractive region in strategy investigation,’’ which has successfully been applied by herself in robot assembly, robot grasping, and part recognition. The work has been reported in Advanced Manufacturing Alert (Wiley, 1999). Her current research interests include information-based strategy investigation, robotics and intelligent agents, animation, machine learning (neural networks and support vector machines), and pattern recognition. Dr. Qiao is a member of the Program Committee of the IEEE International Conference on Robotics and Automation from 2001 to 2004. She is currently the Associate Editor of the IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS-PART B and the IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING.
Bo Zhang received the B.Sc. degree in mathematics from Shandong University, Jinan, China, the M.Sc. degree in mathematics from Xi’an Jiaotong University, Xi’an, China, and the Ph.D. degree in applied mathematics from the University of Strathclyde, Glasgow, U.K., in 1983, 1985, and 1992, respectively. From 1985 to 1988, he was a Lecturer with the Department of Mathematics, Xi’an Jiaotong University, China. He was a Postdoctoral Research Fellow with the Department of Mathematics, Keele University, Keele, U.K. from January 1992 to December 1994, and with the Department of Mathematical Sciences, Brunel University, Uxbridge, U.K. from January 1995 to October 1997. In November 1997, he joined the School of Mathematical and Informational Sciences, Coventry University, Coventry, U.K., as a Senior Lecturer, where he was promoted to Reader in Applied Mathematics in September 2000 and to Professor of Applied Mathematics in September 2003. Currently, he is a Professor with the Institute of Applied Mathematics, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China. His current research interests include direct and inverse scattering problems, computational electromagnetics, partial differential equations, and machine learning.