Journal o Systems Engineering and Electronics, Val. 18, No. 1, 2007, pp. 142 147 f
Learning Bayesian networks using genetic algorithm*
Chen Fei, Wang Xiufeng & Rao Yimei
Coll. of Information Technical Science, Nankai Univ., Tianjin 300071, P. R. China (Received November 23,2005)
Abstract: A new method to evaluate the fitness of the Bayesian networks according to the observed data is provided. The main advantage of this criterion is that it is suitable for both the complete and incomplete cases while the others not. Moreover it facilitates the computation greatly. In order to reduce the search space, the notation of equivalent class proposed by David Chickering is adopted. Instead of using the method directly, the novel criterion, variable ordering, and equivalent class are combined,moreover the proposed mthod avoids some problems caused by the previous one. Later, the genetic algorithm which allows global convergence, lack in the most of the methods searching for Bayesian network is applied to search for a good model in this space. To speed up the convergence, the genetic algorithm is combined with the greedy algorithm. Finally, the simulation shows the validity of the proposed approach.
Keywords: Bayesian networks, Genetic algorithm, Structure learning, Equivalent class.
1. Introduction
Graphical model is a marriage between graph theory and probability theory. A Bayesian network is a graphical model or a representation of the probabilistic relations among a set of variables [71. Over the last two decades, the Bayesian network has become a popular representation for encoding uncertain knowledge in expert system. It has been applied in the fraud dete~tion'~], robot navigation"], data compression[51 and so many other areas. There are two parts for a Bayesian network model, the quality part and the quantity part. The quality part refers to the network structure while the quantity part deals with a conditioned probability given its parents. Thus, when learning a Bayesian network, we should pay attention to both of them. There are four kinds of learning issues which are listed in Table 1. The most complex one is when the structure is unknown and the cases observed is incomplete, which is what we are going to deal with. Most of the literature on learning with Bayesian networks is concerned with model selection. In these approaches, some criterion is chosen to measure the fitness of the network structure according to the data. Then a search algorithm is used to good network structures which receive high score of this criterion. The most common criterion used is log relative posterior probability log p ( D I Sh) where D denotes the data observed and Sh means the structure hypothesis. The computation of this criterion requires
the consideration of the parameters, which costs a lot of time. Much attention has been devoted to the study of Bayesian network. In Ref. [2], David Chickering presented the general formula of a search space for which the states of search correspond to the equivalent classes. Wray Buntine discussed different method of learning Bayesian networks from data. Connections were drawn between statistical, neural network and uncertain community, etc'14'. Friedman et al. extended ndive Bayes by allowing slightly less restricted structure while still optimizing likelih~od"~]. Chen R proposed a collective method for addressing the problem of learning the structure of Bayesian network from distributed data"51. Genetic algorithm is an evolutionary model based optimization algorithm[61. The main advantage of genetic algorithm is that it allows global convergence. Moreover it has the ability to find an optimum solution or a set of better ones given a reasonable time limit.
2. New computation
21 Preliminary knowledge .
One needs two things to specify a Bayesian network: the network structure and conditioned probabilities with respect to each node, also called the parameters of this network. Learning Bayesian networks refers to the issues of construction of both the networks structure and parameters.
* This project was supported by the National Natural Science Foundation of China (70572045).
Learning Bayesian networks using genetic algorithm The most common method to construct a Bayesian network is to elicit from domain experts. It works well for small networks, but when the networks grow larger it becomes helpless. Even sometimes, we have to confront with the case when no expert is available. For the above reason, the main research objective concerning the learning of Bayesian networks is how to construct Bayesian networks from data or we can say cases. A case refers to one observation of the variables in the Bayesian networks. If all the variables are observed, the case is complete, otherwise incomplete. According to whether the structure is known and whether the cases are complete, the learning of Bayesian networks from data can be categorized into four kinds, as shown in Table 1.
Table1 Four kinds of learning issues
Complete Incomplete Complete Incom lete Structure
143
(4)The parameters independence was introduced by Spiegelhalter and Lauritzen [91.
Theorem 1 log(p(D, sh))=log(p(Sh
c
))+C C
i=l j=1
n
4,
[log((<l)!)+~logT(N,,+l)logT(N, +q)]
k=l
, where
T() refers to the gamma function, q is a integer and
NGkdenotes number of cases where the value of the variable xi is equal to xl! when its parent is Pa/ Gamma function is as follows r ( i ) =1 T(x+ 1) = x T ( x )
.
Among this four kinds, the most complex one is when the structure is unknown and the cases observed is incomplete. In the following subsection, we will use D and S h to denote the cases and the network structure respectively.
"
Unknown
Using the chain rule, we can get from Eq. (1) log(p(D, S h ) )=
(2)
Unknown
log
[I, p ( S h ) p ( 8 ,I S h ) p ( DI ~,,Sh)1d8,
we Since p ( S h ) is independent of 8,, can derive the following equation from the previous one. log p ( D , S h )= (3) 1% [P(S"L P(@, I S h ) p ( DI q9S*)d8s1
2.2 New criterion
The criterion most often used for model selection is the log of relative posterior probability. Instead of employing the formula mentioned in [71, we give a new method to calculate it and propose a modified version of this criterion. It allows for the ability to deal with the situation when the cases are incomplete. Above all, we give some reasonable assumptions as follows: (1) Before we see any case, we are indifferent with the network structure and the parameter distribution given a network structure Sh . (2) If we see an incomplete case, we are indifferent with the distribution of the variables unobserved in this case in the remaining feasible space. Here the feasible space refers to the case space which does not conflict with the value of variables observed. (3) The cases are independent given the model.
Due to the case independence, we may rewrite the formula as log ( P ( D , S h ) ) =
where 6),k = p(xl! I Pu,!',6),Sh), Nijk refers to the number of cases where the value of variable xi is equal to xi" when its parent is Pa,!'. It is obvious how to compute N k when a l l the cases are complete. But G the world is not always so perfect. How to deal with the situation when the cases are not complete which we may confront most of the time? By assumption (2), we can derive the following algorithm to compute the Nijk. Algorithm 1 Initialization: for all ( i, j , k ) Ni,j,k =O, For each case: For each pair of ( i, j , k ): If the observed data does not conflict with
144
Chen Fei, Wang Xiufeng & Rao Yimei
c l o g T(Nvk+I)logT(N,+r;)]
k= 1
1 N yk. = N . . +1 N I .
Ilk
(14)
Here, I N I is the size of the feasible space according to the case. Using the parameter independence
n <
Due to the assumption (l), the prior distribution is indifferent to all network structures, and we can ignore this item when searching for a good model, now the new criteria can be given as
n
4,
p(6, ' S h ) = n n p c q j l * . . . , q j , )
i=l
j=l
C ( D , S h ) C C [ l o g ((5l)!)+ =
5
Combining Eqs.(4)and (5), we get log p ( D , S h )=log [ p ( S k ) .
log r ( N i j k +l)log r(Nu ) ] +q
k= I
3. The equivalent class
At the beginning of this section, we first introduce the definition of equivalent class defined in David Chickering ['I. Definition 1 Two dag (directed acyclic graph) g and g ' are equivalent if for every Bayesian network B = (g,@, ) , there exists a Bayesian network B' = (g',O,,) such that B and B' define the same probability distribution, and vice versa. Herckman showed that the search for the best Bayesian network model according to the cases is NPHard. So, we need a heuristic algorithm to search for a good model in solution space. This is why we recall for genetic algorithm. Moreover, Chickering '31 proved that most of the scoring metrics (including ours) return the same score of fitness according to the networks structure in the same equivalent class. Thus, if we found a representation of the equivalent class and a scoring metrics on this class, we can reduce the size of model space greatly. This is what we have done in this section.
The following formula is proved by Kevin Karplus[*'
where q is an integer regarding the value of qj. For the special case when Nuk = 0 for any (i, j , k )
(9)
Since I; is an integer, Eq.(9) is converted into 1 L7dOij=(q l)! Since
Moreover p(qjl,..,@,l,) a constant with respect is to qj ,we get
3.1 Representationof the equivalent class
Verma and Pearl "" gave the following characteristics of equivalent class: Theorem 2 Two dags are equivalent if and only if they have the same skeletons and the same vstructures. Here, the skeleton and vstructure is defined as follows. Definition 2 The skeleton of a dag results from this dag by removing the direction of its arcs. Definition 3 The vstructure refers to a triple (x, y , z ) while there is arcs from x to y and from z to y, but not arcs (remove) between x and z.
p<Ogl,***,qj, l)! 1= (<
Combining Eqs.(7) and (12), we get
(12)
log Ip(D,Sh)l= log [ p ( S h ) .
Then
Learning Bayesian networks using genetic algorithm Figure 1 shows an example of vstructure.
145

Fig1 An example of vstructure
There is a representation of equivalent classes described in Chickering "I which is called completed pdags (partial directed acyclic graph). Instead of using the pdags, we propose a new modified version of pdags which is called opdags (ordered partial directed acyclic graph). opdags is defined as follows. Definition 4 Given a pdag, we can construct the opdags in this way: we first give all the variables of this pdag an order, then we can treat each arc as follows: if the arc is directed, then we do nothing, if the arc is undirected between two nodes x and y (assuming x > y ), then we always treat this arc as x> y . The difference is that we use a ordering between variables. The reason for this will be explained later. By this definition, we can represent the equivalent classes as opdags. Definition 5 The search space of possible network structure which is described by dag is called bspace. Definition 6 The search space of possible network structure which is described by opdag is called espace.
The following shows how we can deal with the problem described above. In section 2, we get a scoring metric which is defined on a single network structure as others. Moreover, in subsection A, we get a representation of equivalent class which is called opdags. Now we can see if the metric can be applied in the opdags. The answer is yes but with a tiny modified version of algorithm 1: when we look for a variable's parents, we will only consider the variables whose order is smaller than this one. Since the transformation no longer exists, the first problem is solved. And using the metric directly defined on the representation of equivalent class, we eliminate the second problem.
4. Greedy genetic algorithm
For the reason mentioned above, we need a heuristic algorithm to search for good models. The reason why we consider genetic algorithm is that it allows global convergence and is easy to be implemented. But genetic algorithm also has its own drawback: i.e. the convergence speed is slow. In order to cope with this problem, we combine the genetic algorithm with the greedy operator.
4.1 Coding
The coding of chromosome is as follow:
P = ((4
1 9
(v)*(4
)9
(u2
1,. (41,(u, )>
9
here, di refers to the directed parents of variable i ,and
d, =(d,l,d,2,.~,d,,)is the mth directed parent . d,,
3.2 Scoring metrics
The method described in Chi~kering'~' computing for the scoring metric is as follows: First they use the pdags to represents the equivalent class, since the scoring metric is defined on the single network structure but not equivalent class, they should turn the pdags into the dags which represent single network structures and later turn it back. This may cause two problems: first, obviously the transformation between pdags and dags needs a lot of time. If we could apply the criterion directly on pdags, the transformation is not needed at all. Second, a local change of pdags may not correspond to the local change of the dags in this class. This is unfortunate since the metric can reflect each local change of the dags.
of variable i . u, is defined as the undirected parents of the ith variable. The form of u, is the same as d, . Due to the ordering, we can always ignore the (remove) dl and ul .?hereby, if there are n variables, there would be only n  1 pairs of d, and ur . So we may write the chromosome as p = ((d,),(u,),., (d"),(u,)) .
4.2 Fitness function
Here, we directly use the metric described in section 2 with some modification mentioned in section 3.
4.3 Operator
Select operator: roulette selection, which is the same in the classical genetic algorithm. In order to ensure global convergence,we use elitist selection.
146 Graph operator: the operations of this operator are described as follows: (1) For any undirected arcs, we can delete it which means that delete the variable from ui . (2) For any directed arcs, we can either delete it (means deleting it from di ) ( 3 ) For any pair of nodes that are not adjacent, we can either add a directed arc (means adding a variable in d , ) or an undirected arc (means adding a variable in
ui
Chen Fei, Wang Xiufeng & Rao Yimei
the elitist genetic algorithm, the monotone condition is also met. Now, according to theorem 3 , the greedy genetic algorithm is global convergence of probability 1. From the proof, we can get the answer of the problem which is mentioned in subsection C.
5. Simulation
The simulations are organized as follows: Firstly, we apply existing Bayesian networks to generate the cases using the Gibbs Sampling. The Gibbs sampling is a unbiased generator of cases in the sense that the probability a case is generated is equal to the probability of that case existing according to the networks. The number of cases generated is 500. Each variable has 5 possible values. The following simulation will show both the validity and efficiency of our approach. Figure2 gives the comparison of this new criterion using the new computation method and p ( D , S h ) mentioned in Ref.[7]. Figure 3 shows the efficiency of espace comparing the bspace. Moreover, we give the comparison of validity between espace and bspace. The measure we used here is the criterion we have proposed in section 2.
2000
E ,
1
Greedy operator: we make adjustments by graph operator until no adjustment can be made to improve the fitness of models. Here comes a problem: why shouldn’t we combine the greedy operator directly into each case of graph operator which seems more concise. The reason will be explained in subsection E. Mutation operator: this operation is defined in Ref.[ 6 ] . Crossover operator: we use single point crossover which is also defined in Ref.[6].
4.4 Algorithm
Algorithm 2 Step 1 Choose the parameters of genetic algorithm, such as the size of population, the stop criteria etc. Step 2 Coding and initializing the population. Step 3 Genetic operators’ operation. Step 4 Stop if the stop criterion is satisfied else return to step 3 .
I0001
ii
Node number noncquivalent .equivalent
;
I
4.5 Proof
Using the theorem mentioned in [ l l ] , we can give a proof of the global convergence for this algorithm. Theorem 3 If an evolutionary algorithm is ergodic and monotone, the algorithm will be global convergent of probability 1. Theorem 4 The greedy genetic algorithm is global convergence of probability 1. Proof Using the graph operator, we can delete any opdag into a opdag without any arc. Then by the graph operator again, we can construct any opdag from the opdag without any arcs. Then the condition of ergodicity is satisfied. Moreover, since we are using
Fig.? The computation of log(p(D,S h ) ) with and without new method
2mt
.
~
2
3
.
4 5
6
7 8
9
10
I
Learning Bayesian networks using genetic algorithm
and genetic algorithm with greedy operator. Table 2 described the convergence era of genetic algorithm using greedy operator and regular genetic algorithm.
Table 2 Speed of convergencefor genetic algorithm with greedy
operator and regular genetic algorithm Reeular Genetic algorithm genetic node number w t greedy operator ih
0
147
Conference on Uncertainty in Artijicial Intelligence. Montreal: Morgan Kaufmann Publishers, 1997: 8798. [4] Ezawa K J, Schuermann T. FraudlLTncollectible debt detection using a Bayesian network based learning system: a rare binary outcome with mixed data structures. Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. Montreal: Morgan Kaufmann Publishers, 1997: 157166. [5] Frey B J. Graphical models for machine learning and digital communication. Cambridge, MA: MITpress, 1998. [6] Goldberg D E. Genetic algorithms in search, optimization, and machine learning. Boston: Addison Wesley,1998. [7] Heckerman D. A tutorial on learning with Bayesian networks. Microsofi Research, Tech Rep: MSRTR9506, 1995. [8] Karplus K. Regularizers for estimating distributions of amino acids from small samples. University of California , Tech Rep: UCSCCRL95 1I , 1995. [9] Lauritzen S, Spiegelhalter D. Local computations with probabilities on graphical structures and their application to expert systems. Royal Statistical Society B, 1988,50(2): 157224. 101 MacKay D J C . Introduction to monte carlo methods. In: Proceedings of NATO Advanced Study Institute on Learning in Graphical Models. MI Jordan :MIT Press, 1998: 175204. 1I] Schleuter M G. Asparagos 96 and the travelling salesman f problem. Proceedings o the Fourth International Conference on Evolutionary Computation. Cambridge, MA: The MIT Press, 1996: 25 1272. [12] Verma T, Pearl J. Equivalence and synthesis of causal models. Proceedings of Sixth Conference on Uncertainty in Artificial Intelligence. Boston, MA: Morgan Kaufmann, 1990: 220227. [13] Friedman N, Geiger D, Goldszmidt M. Bayesian network classifiers. Machine Learning, 29: 131163. [14] Buntine W. A guide to the literature for learning probabilistic network from data. IEEE Transaction on Knowledge and Data Engineer, 1996,8(2), 195208. [15] Chen R, Sivakumar K, Kargupta H. Learning bayesian network structure from distributed data. Proceedings of the 3rd SIAM fnternational Data Mining Conference, 2003:284288 .
1 2
1 1
1
3 3
10
3
1
3
3
10
4
5
19
21
6
I
28
50
50
8
9
30
50 50
100
10
100
The simulation above shows that the improvement we have made is valid.
6. Conclusions
In this paper, we propose a new method for computing the log ( P ( S , D ) ), which is used to evaluate the fitness of the network structure. Moreover, we use the notation of equivalent class which is proposed by David Chickering and make great improvement about it. At last, we introduce the greedy operator in the genetic algorithm which improve the convergence speed of our method to a considerable extend. The simulation shows the validity and efficiency of the proposed approach.
References
[l] Berler A, Shimony S E. Bayes networks for sonar sensor fusion. In: proceedings of the thirteenth conference on uncertainty in artificial intelligence. Rhode Island: Morgan Kaufmann Publishers,l997: 1421. 1 1 Chickering D M. Learning equivalent classes of Bayesian 2 networks structures. Proceedings of the Twelfh Conference on Uncertainty in Artificial Intelligence. Portland: Morgan Kaufmann Publishers, 1996: 150157. [3] Chickering D M. A transformational characterization of Bayesian network structures. In: Proceedings of Eleventh
Chen Fei was born in 1982. Now he is a doctor student of Nankai University, his research interests include Bayesian network and complex network. Email: lattice@china.cOrn.cn