On the E ect of Analog Noise in Discrete-Time Analog Computations
Wolfgang Maass Institute for Theoretical Computer Science Technische Universitat Graz May 28, 1997 Pekka Orponen Department of Mathematics University of Jyvaskylay
We introduce a model for analog computation with discrete time in the presence of analog noise that is exible enough to cover the most important concrete cases, such as noisy analog neural nets and networks of spiking neurons. This model subsumes the classical model for digital computation in the presence of noise. We show that the presence of arbitrarily small amounts of analog noise reduces the power of analog computational models to that of nite automata, and we also prove a new type of upper bound for the VC-dimension of computational models with analog noise.
Abstract
1 Introduction
Analog noise is a serious issue in practical analog computation. However there exists no formal model for reliable computations by noisy analog systems which allows this issue to be addressed in an adequate manner. We propose and investigate such model in this article. The investigation of noise-tolerant digital computations in the presence of stochastic failures of gates or wires was initiated by von Neumann, 1956]. We refer to Cowan, 1966], Pippenger, 1989] and Gal, 1991] for a small sample of the numerous results that have been achieved in this direction. In all these articles one considers computations which produce a correct output not with perfect reliability, but with probability 1 + (for some parameter 2 1 2 (0; 2 ]). The same framework (with stochastic failures of gates or wires) has been applied to analog neural nets in Siegelmann, 1994]. The abovementioned approaches are insu cient for the investigation of noise in analog computations, because in analog computations one has to be concerned not only with occasional total failures of gates or wires, but also with \imprecision", i.e. with omnipresent smaller (and occasionally larger) perturbations of analog outputs of internal computational units. These perturbations may for example be given by Gaussian distributions. Therefore we introduce and investigate in this article a notion of noise-robust computation by noisy
Klosterwiesgasse 32/2, A{8010 Graz, Austria. E-mail: maass@igi.tu-graz.ac.at. P. O. Box 35, FIN{40351 Jyvaskyla, Finland. E-mail: orponen@math.jyu. . Part of this work was done while this author was at the University of Helsinki, and during visits to the Technische Universitat Graz and the University of Chile in Santiago.
y
1
analog systems where we assume that the values of intermediate analog values are moved according to some quite arbitrary probability distribution. We consider { as in the traditional framework for noisy digital computations { arbitrary computations whose output is correct 1 1 with some given probability 2 + (for 2 (0; 2 ]) . We will restrict our attention to analog computations with digital output. Since we impose no restriction (such as continuity) on the type of operations that can be performed by computational units in an analog computational system, an output unit of such a system can convert an analog value into a binary output via \thresholding". We show in Theorem 3.1 of this article that any language recognized by such noisy analog computational system is regular. Our model and Theorem 3.1 are somewhat related to the analysis of probabilistic nite automata in Rabin, 1963], although in Rabin's case the niteness of the state space simpli es the setup considerably. Continuous-space noise models similar to ours have been used in general studies of the stability of dynamical systems a ected by random perturbations (e.g., Kifer, 1988]), but our work is to our knowledge the rst to consider the computational aspects of systems of this type. More speci c hardware-oriented models for analog noise in analog neural nets have been discussed in Phatak, Koren 1995]. Another related work is Casey, 1996], which addresses the special case of analog computations on recurrent neural nets, where the analog noise can move an internal state at most over some bounded distance , and the digital output is required to be perfectly reliable (i.e. = 1=2 in the present notation). Corollary 3.1 of Casey, 1996] states a special case of our Theorem 3.1 for the model considered in that article. The proof of Corollary 3.1 of Casey, 1996] is incorrect.1 A correct proof is contained as a special case in the proof of Theorem 3.1 in Section 3 of this article.2 Apart from Corollary 3.1 there is no further overlap between Casey, 1996] and this article. There are relatively few examples of nontrivial computations on common digital or analog computational models that can achieve perfect reliability of the output in spite of noisy internal components. Most constructions of noise-robust computational models rely on the replication of noisy computational units (see von Neumann, 1956], Cowan, 1966]). The idea of this method is that the average of the outputs of k identical noisy networks (with stochastically independent noise processes) is more reliable than the output of a single network. However there exists in general a small but nonzero probability that this average deviates strongly from its expected value. In addition, if one assumes that the computational unit that produces the output of the computation is also noisy, one cannot expect the reliability of the output of the computation to be larger than the reliability of this last computational unit. Consequently there exist many methods for reducing the error-probability of the output to a small value, but these methods cannot achieve error probability 0 at the output. In addition, if one wants to investigate computations with common noise distributions such as Gaussian noise { which may in principle move a state to any other state { it is
1 Corollary 3.1 is derived as a corollary of Theorem 3.1 in Casey, 1996], whose proof relies on the assumption that the recognized language is regular. The proof given for Corollary 3.1 is the following: \The proof of a corollary is simply to notice that by the compactness of the phase space it can contain only a nite number of disjoint sets with nonempty interior". The following counterexample shows that this argument is wrong: The intervals 1=(2i + 1); 1=2i] for i = 1; 2; ::: are in nitely many disjoint sets with nonempty interior, which are all contained in the compact set 0; 1]. 2 Actually, since there is no need to analyze probability distributions for this special case, one can prove Corollary 3.1. of Casey, 1996] more directly by considering the equivalence relation de ned at the beginning of Section 3, and by deriving a lower bound for the volume of the set of states that correspond to an equivalence class.
2
necessary to move to a computational model with less than perfect reliability of the output bit, since otherwise the model would not be able to carry out any nontrivial computations. Therefore we focus our attention in this article on the general case where the reliability of the network output is just required to be 1=2 + for some 2 (0; 1 ]. 2 Unfortunately an investigation of computations with less than perfect reliability requires a more complex mathematical analysis. In a computational model with perfect reliability of the output it cannot happen that an intermediate state q occurs at some step t both in a computation for an input x that leads to output \0" , and at step t in a computation for the same input x that leads to output \1" . Hence an analysis of perfectly reliable computations can focus on partitions of intermediate states q according to the computations and the computation steps where they may occur. In contrast to that, in a computational model with less than perfect reliability of the output bit the same internal state q may occur at an intermediate step in computations paths that yield di erent output bits. Hence for such a model one has to analyze probability distributions over intermediate states q . Consider for example the special case of a sigmoidal neural net (with thresholding at the output), where for each input the output of an internal noisy sigmoidal gate is distributed according to some Gaussian distribution (perhaps restricted to the range of all possible output values which this sigmoidal gate can actually produce). In this case an intermediate state q of the computational system is a vector of values which have been produced by these Gaussian distributions for di erent sigmoidal gates. Obviously each such intermediate state q can occur at any xed step t in any computation (in particular in computations with di erent network output for the same network input). Hence perfect reliability of the network output is unattainable in this case. For an investigation of the actual computational power of a sigmoidal neural net with Gaussian noise one has to drop the requirement of perfect reliability of the output, and instead analyze how probable it is that a particular network output is given, and how probable it is that a certain intermediate state is assumed. Hence one has to analyze for each network input and each step t the di erent probability distributions over intermediate states q that are induced by computations of the noisy analog computational system. In fact, one may view the set of these probability distributions over intermediate states q as a generalized set of \states" of a noisy analog computational system. In general the mathematical structure of this generalized set of \states" is substantially more complex than that of the original set of intermediate states q . In Section 2 below, we de ne a rigorous mathematical model for this type of noisy analog computations and introduce some basic methods for analyzing this generalized set of \states". The preceding remarks may illustrate that if one drops the assumption of perfect reliability then a more complex variety of computations becomes possible, and the computational power of a system may potentially increase. In fact, in theoretical computer science a substantial number of constructions rely on the premise that the computational power of a digital computational system does in fact increase if it gets access to random bits and less than perfect reliability of the output bit is tolerated. This is relevant for the discussions of this article, since internal noise of a noisy computational system may also be viewed as something positive: as a \free" source of random numbers, which may actually be helpful for certain computations. In Section 3 we prove an upper bound for the computational power of noisy analog computational systems that limits the potential impact of such e ects in analog computation. We show that, under mild constraints on the noise characteristics, noisy analog systems 3
with bounded nite-dimensional state spaces have at most the computational power of nite automata. This upper bound is quite general, and it also covers practically relevant special cases such as systems with dependencies among di erent sources of stochasticity, as well as noisy computations in hybrid analog/digital computational models, such as a neural net combined with a binary register, or a network of noisy spiking neurons where a neuron may temporarily assume the discrete state \not- ring". One goal of our investigation of the e ects of analog noise is to nd out which features of the noise process have the most detrimental e ect on the computational power of an analog computational system. This turns out to be a nontrivial question. For example, one might think that analog noise which is likely to move an internal state over a large distance is more harmful than another type of analog noise which keeps an internal state within its neighborhood. However this intuition is deceptive. Consider the extreme case of analog noise in a sigmoidal neural net which moves a gate output x 2 ?1; 1] to a value in some "-neighborhood of ?x, and compare it with noise which moves x to an arbitrary value in the 10"-neighborhood of x. The rst type of noise moves some values x over large distances, but is likely to be less harmful for noise-robust computing than the second type, as the large jump from x to ?x represents just a recoding of the output value. As a rst step towards characterizing those aspects and parameters of analog noise that have a strong impact on the computational power of a noisy analog system, the proof of Theorem 3.1 below provides an explicit bound on the number of states of any nite automaton that can be implemented by an analog computational system with a given type of analog noise. It is quite surprising to see on which speci c parameters of the analog noise the bound depends (cf. Remark 3.3 following the proof). In Section 4 we prove a partial converse to the upper bound result in Section 3, by showing that if one only considers bounded noise processes (where the analog noise can move an internal state at most over a distance , for a su ciently small value of ), then any nite automaton can be simulated with perfect ( = 1=2) reliability by a recurrent analog neural net of the type discussed in Anderson et al., 1977], Siegelmann, Sontag, 1991]. Other embeddings of nite automata in recurrent sigmoidal networks include Frasconi et al., 1996] and Omlin, Giles, 1996] which discuss, respectively, implementations of automata in noisefree radial basis function networks, and in second-order networks with synaptic noise. In Section 5 we establish a new type of upper bound for the VC-dimension of computational models with analog noise. We show that in the presence of arbitrarily small amounts of analog noise there exists an upper bound for the VC-dimension of, e.g. neural nets that is independent of the total number of units in the case of a feedforward architecture, and independent of the computation time in the case of a recurrent neural net. This contrasts with the anomaly that in the noise-free setting the classes of, e.g. nite recurrent analog neural nets Siegelmann, Sontag, 1991] and nite recurrent networks of spiking neurons Maass, 1996] have in nite VC-dimension, and are thus strongly \unlearnable" from the point of view of learning theory. Again, the proofs of the Theorem 5.1, and its Corollaries 5.3 and 5.4 provide explicit (although very large) upper bounds for the VC-dimension of noisy analog neural nets with batch input, which depend on speci c parameters of the analog noise.
4
2 Preliminaries: Computational Systems and Noise Processes
We shall de ne our computational model rst in the noise-free setting, and then consider the e ect of noise on computations separately. An analog discrete-time computational system (brie y: computational system) M is de ned in a general way as a 5-tuple h ; p0; F; ; si, where , the set of states, is a bounded subset of Rd , p0 2 is a distinguished initial state, F is the set of accepting states, is the input domain, and s : ! is the transition function. To avoid unnecessary pathologies, we impose the conditions that and F are Borel subsets of Rd , and for each a 2 , s(p; a) is a measurable function of p. We also assume that contains a distinguished null value t, which may be used to pad the actual input to arbitrary length. The nonnull input domain is denoted by 0 = ? ftg. The intended noise-free dynamics of such a system M is as follows. The system starts its computation in state p0 , and on each single computation step on input element a 2 0 moves from its current state p to its next state s(p; a). After the actual input sequence has been exhausted, M may still continue to make pure computation steps, which lead it from a state p to the state s(p; t). The system accepts its input if it enters a state in the class F at some point after the input has nished. (We shall give a more precise de nition of the dynamics, including the e ects of noise, later.) For instance, the recurrent analog neural net model of Siegelmann, Sontag, 1991] (also known as the \Brain State in a Box" model of Anderson et al., 1977]) is obtained from this general framework as follows. For a network N with d neurons and activation values between ?1 and 1, the state space is = ?1; 1]d. The input domain may be chosen as either = R or = f?1; 0; 1g (for \online" input) or = Rn (for \batch" input). In each case the value zero (or the zero vector) serves conveniently as the null value t. For simplicity, we treat here formally only the cases where R; the extensions to the case = Rn are straightforward. The transition function s : ! is in this model given in terms of a d d weight matrix W = (wij ), a d-component bias vector h = (hi ), a d-component input weight vector c = (ci), and a neuron activation function : R ! ?1; 1]. For any p 2 and a 2 , we de ne s(p; a) = p+ , where for each i = 1; : : :; d,
p+ = ( i
Xd
Both Anderson et al., 1977] and Siegelmann, Sontag, 1991] use the saturated-linear sigmoid activation function 8 > ?1; if u < ?1; < (u) = > u; if ? 1 u 1; : 1; if u > 1; but one may obviously also de ne the model with respect to other activation functions, notably the \standard sigmoid" (u) = tanh u, or the discontinuous signum function ( sgn(u) = ?1; if u < 0; 1; if u 0; the latter choice yielding the model of recurrent threshold logic networks. The initial state in each of these models may be chosen as p0 = (?1; : : :; ?1), and the set of accepting states is determined by the activity of some speci c output unit, say unit 1, so that F = fp 2 j p1 > g, for some threshold value > 0. 5
j =1 wij pj + hi + cia)
In the sequel, we shall use to denote also the componentwise extension of the chosen activation function to state vectors, so that for any p 2 , (p) := ( (p1); : : :; (pd)). This convention lets us write the transition function s de ned above compactly as s(p; a) = (Wp + h + ac): Feedforward analog neural nets may also be modeled in the same manner, except that in this case one may wish to select as the state set := ( ?1; 1] fdormantg)d, where dormant is a distinguished value not in ?1; 1]. This special value is used to indicate the state of a unit whose inputs have not all yet been available at the beginning of a given computation step (e.g. for units on the l-th layer of a net at computation steps t < l). The completely di erent model of a network of m stochastic spiking neurons (see e.g. Gerstner, van Hemmen, 1994] or Maass, 1997]) is also a special case of our general frameS work. In this case one wants to set sp := ( lj =1 0; T )j fnot- ringg)m , where T > 0 is a su ciently large constant so that it su ces to consider only the ring history of the network during a preceding time interval of length T in order to determine whether a neuron res (e.g. T = 30 ms for a biological neural system). If one partitions the time axis into discrete time windows 0; T ) ; T; 2T ) ; : : : ; then in the noise-free case the ring events during each time window are completely determined by those in the preceding one. A component pi 2 0; T )j of a state in this set sp indicates that the corresponding neuron i has red exactly j times during the considered time interval, and it also speci es the j ring times of this neuron during this interval. Due to refractory e ects one can choose l < 1 for biological neural systems, e.g. l = 15 for T = 30 ms. With some straightforward formal operations one can also write this state set sp as a bounded subset of Rd for d := l m. Let us then consider the e ect of noise on computations. Let Z (p; B ) be a function that for each state p 2 and Borel set B indicates the probability of noise corrupting state p into some state in B. The function Z is called the noise process a ecting M , and it should satisfy the mild conditions of being a stochastic kernel Feller, 1971, p. 205], i.e., for each p 2 , Z (p; ) should be a probability distribution, and for each Borel set B, Z ( ; B) should be a measurable function. We assume that there is some measure over so that Z (p; ) is absolutely continuous with respect to for each p 2 , i.e. (B ) = 0 implies Z (p; B ) = 0 for every measurable B . By the Radon{Nikodym theorem Feller, 1971, p. 140], Z then possesses a density kernel with respect to , i.e. there exists a function z ( ; ) such that for any state p 2 and Borel set B , Z Z (p; B) = z(p; q) d : We assume that this function z ( ; ) has values in 0; 1) and is measurable. (Actually, in view of our other conditions this can be assumed without loss of generality.) The dynamics of a computational system M a ected by a noise process Z is now de ned as follows. If the system starts in a state p, the distribution of states q obtained after a single computation step on input a 2 is given by the density kernel a (p; q ) = z (s(p; a); q ). (Note that as a composition of two measurable functions, a is again a measurable function.) The long-term dynamics of the system is given by a Markov process, where the distribution starting in state p is xa (p; q ) of states after jxaj computation steps with input xa 2 de ned recursively by Z xa (p; q ) = x (p; r) a (r; q ) d :
r2 q2B
6
One easily can verify by induction on juj that
xu (p; q ) =
Z
for all x; u 2 of length 1 . Let us denote by x(q ) the distribution x (p0; q ), i.e. the distribution of states of M after it has processed string x, starting from the initial state p0 . Let > 0 be the required reliability level. In the most basic version the system M accepts (rejects) some input x 2 0 R if F x (q ) d 1 + (respectively 1 ? ). In less trivial cases the system may also perform 2 2 pure computation steps after it has read all of the input. Thus, we de ne more generally that the system M recognizes a set L 0 with reliability if for any x 2 0 : Z 1 + for some u 2 ftg x2L , xu (q ) d 2 ZF 1 ? for all u 2 ftg : x2L , = xu (q ) d 2
F
r2
x (p; r)
u (r; q ) d
This covers also the case of batch input, where jxj = 1 and 0 is typically quite large (e.g. 0 = Rn ). One gets a reasonably realistic model for noise in an analog neural net with state space = ?1; 1]d by de ning the noise process Z so that it re ects a \clipped Gaussian distribution". Without more speci c knowledge about the noise source, this appears to be the most appropriate model for analog noise in an analog neural net. One assumes in this model that for any computation step the intended output pi 2 ?1; 1] of the ith unit of the net is replaced by a \clipped Gaussian distribution" of values qi 2 ?1; 1], where values < ?1 (> 1) are rounded to ?1 (respectively 1). If one assumes that this rounding occurs independently for each of the d units i in the network, and if one assumes for simplicity that all the underlying Gaussians have the same variance, then one arrives in our general framework for a noisy computational system M at a noise process Z where Z (p; ) is de ned for each p 2 = ?1; 1]d by a symmetric Gaussian distribution with density z(p; q) = (k q ? p k) around p, but with all values qi < ?1 (qi > 1) of the occurring states hq1 ; : : :; qd i rounded to ?1 (respectively 1). (Here k v k denotes the Euclidean norm of a vector v , and is the density function of some symmetric d-variate Gaussian distribution.) Since such a rounding process will assign probability > 0 to the lower-dimensional bounding hyperrectangles of , we cannot simply de ne as the Lebesgue measure over in order to subsume this type of analog noise under our general noise model. Rather one has to decompose into components 1 ; : : :; k (representing the interior 1 and lower dimensional bounding hyperrectangles 2 ; : : :; k of = ?1; 1]d), and de ne as a sum of measures 1 + : : : + k , where 1 is the Lebesgue measure over 1 and 2 ; : : :; k are Lebesgue measures for the lower dimensional spaces 2 ; : : :; k . In the case of a network of spiking neurons, the noise model has to take into account that not only the ring time of a neuron is subject to some jitter (which can be modeled by a Gaussian distribution), but also neurons may randomly fail to re, or they may re \spontaneously" (i.e. even when they would not re in the corresponding deterministic model). All these e ects can be modeled by a suitable noise process Z de ned on the state space sp discussed earlier, with a measure over de ned via a decomposition of similarly as in the case of analog neural nets. 7
3 An Upper Bound for the Computational Power of Systems with Analog Noise
It has been shown for various concrete models of analog computation without noise (such as generalized shift maps Moore, 1990], recurrent neural nets Siegelmann, Sontag, 1991] and networks of spiking neurons Maass, 1996]) that they can simulate a universal Turing machine, and hence have immense computational power. It has long been conjectured that their computational power collapses to that of a nite automaton as soon as one assumes that they are subject to even small amounts of analog noise. We provide in this section a proof of this conjecture. Furthermore we make explicit (see Remark 3.3) on which parameters of the analog noise the required number of states of a simulating nite automaton depends. Our proof requires a mild continuity assumption for the density functions z (r; ) , which is satis ed in all concrete cases that we have considered. We do not require any global continuity property over for the density functions z (r; ) because of the previously discussed concrete cases, where the state space is a disjoint union of subspaces 1 ; : : :; k with di erent measures on each subspace. We only assume that for some arbitrary partition of into Borel sets 1 ; : : :; k the density functions z (r; ) are uniformly continuous over each j , with moduli of continuity that can be bounded independently of r . In other words, we require that z ( ; ) satis es the following condition: A function ( ; ) from 2 into R is called piecewise equicontinuous if for every " > 0 there is a > 0 such that for every r 2 , and for all p; q 2 j , j = 1; : : :; k: kp?q k implies j (r; p) ? (r; q )j ": (1) Note that because the state space is bounded, any restriction (r; ) of a piecewise equicontinuous function ( ; ) to xed r 2 has bounded range. If z ( ; ) satis es condition (1), we call also the resulting noise process Z piecewise equicontinuous. Our preceding discussions suggest that all practically relevant noise processes Z have this property. To formulate our result, we need a notion of regular sets of sequences over arbitrary domains 0 , which we de ne as follows. Let L 0 be a set of sequences over an input domain 0. Sequences x; y 2 0 are equivalent with respect to L if one has xw 2 L , yw 2 L for all w 2 0 . The set L is regular if this equivalence relation has only nitely many equivalence classes. By the Myhill{Nerode theorem Hopcroft, Ullman, 1979, pp. 65{67], for nite alphabets 0 this de nition coincides with the usual de nition of regular sets via nite automata. From the point of view of computational complexity theory, machine models that only accept regular sets belong to the most \primitive" class of models. In contrast to Turing machines and other \universal" computational models, the number of internal states of such machine models is xed, independently of the length of the input string.
Theorem 3.1 Let L
be a set of sequences over an arbitrary input domain 0 . Assume that some computational system M , a ected by a piecewise equicontinuous noise process Z , recognizes L with reliability , for some arbitrary > 0. Then L is regular.
0
Proof: Let M = h ; p0; F; ; si, where =
L. We shall show that there are only nitely many equivalence classes of sequences with respect to L. We begin by observing that if for two sequences x; y 2 0 , the distributions x ( ) and y ( ) are su ciently close, then x and y are equivalent. To see this, assume that
8
0
ftg, be the system in question recognizing
R
, and suppose for a contradiction that x and y are not equivalent. Then there exists some w 2 0 with xw 2 L , yw 2 L.R Without loss of generality, = 1 assume that xw 2 L. Thus there exists some u 2 ftg with F xwu (q ) d 2 + and R 1 ? . This yields the contradiction F ywu (q ) d 2
r2
j x(r) ? y (r)j d
Z
2 = =
x (r) wu (r; q ) d d ? y (r) wu (r; q ) d d q2FZ r2 q2F r2 Z j x(r) ? y (r)j wu(r; q) d d
Z
Z
q2F
xwu (q ) d
Z
?
Z
q2F
ywu (q ) d
Z Z
q2F r2
r2
j x(r) ? y (r)j
R
Z
implies that x; y 2 0 are equivalent. Thus we have shown that r2 j x (r) ? y (r)j d Next we observe that all the density functions x ( ) for x 2 are piecewise uniformly continuous, with the same bounds on their moduli of continuity as the noise density functions z(r; ) have. This is veri ed by induction on jxj. Given " > 0, let > 0 be such that the density function z ( ; ) satis es condition (1) for all r 2 and j = 1; : : :; k. We then have for any x 2 + , a 2 , and all p; q 2 j such that k p ? q k :
:
q2F
wu (r; q ) d
d
j xa(p) ?
xa (q )j
= =
Z
Zr2
x(r) x(r)
j a(r; p) ? a(r; q)j d jz(s(r; a); p) ? z(s(r; a); q)j d
The preceding observation now implies that the space of all functions x( ) for x 2 0 can be partitioned into nitely many classes C so that any two functions x ( ), y ( ) in the R , and hence correspond to sequences that are same class C satisfy r2 j x(r) ? y (r)j d equivalent with respect to L. Such a partition can for example be achieved in the following way. Using the piecewise uniform continuity of the x( ), choose from within each component j of a nite set (or \grid") Gj that is so dense that for each r 2 j , if tr 2 Gj is the gridpoint closest to r, then j x (r) ? x(tr )j =4 ( ). (To see that such a nite Gj always exists, note that given the value > 0 corresponding to " = =4 ( ) in condition (1), one can by the Bolzano{Weierstrass theorem choose only a nite number of points t from within the bounded set jSso that any two distinct chosen points t, t0 are more than a distance apart.) Take G = k=1 Gj . Now partition the (bounded!) range of all functions x ( ) into j nitely many intervals I of length =2 ( ), and place two functions x ( ), y ( ) in the same class C if for every gridpoint t 2 G the values of x (t) and y (t) fall into the same interval I . Then for any two functions x ( ), y ( ) in the same class C it is the case that for any r 2 j , j = 1; : : :; k, j x(r) ? y (r)j j x(r) ? x(tr )j + j x(tr ) ? y (tr )j + j y (tr ) ? y (r)j = ( ); R R ( = ( )) r2 d = . and thus r2 j x (r) ? y (r)j d 9
" r2 = ":
r2Z
x(r) d
Maass, 1996] for the noise-free case, the preceding Theorem implies that both recurrent analog neural nets and recurrent networks of spiking neurons with online input from 0 can only recognize regular languages in the presence of any reasonable type of analog noise, even if their computation time is unlimited and if they employ arbitrary real-valued parameters.
Remark 3.2 In stark contrast to the results of Siegelmann, Sontag, 1991] and
Remark 3.3 The proof of Theorem 3.1 relies on an analysis of the space of probability dens-
ity functions over the state set . An upper bound on the number of states of a deterministic nite automaton that simulates M can be given in terms of the number k of components j of the state set , the dimension and diameter of , a bound on the values of the noise density function z , and the value of corresponding to " = =4 ( ) in condition (1).
Let us say that a noise process Z de ned on a set Rd is bounded by if it can move a state p only to other states q that have a distance from p in the L1 -norm over Rd , i.e. if its density kernel z has the property that for any p = hp1; : : :; pdi and q = hq1 ; : : :; qd i 2 , z(p; q) > 0 implies that jqi ? pi j for all i = 1; : : :; d. As a partial converse to the upper bound result of the previous Section, we now prove that regular languages over the alphabet 1 f?1; 1g can be recognized with perfect reliability (i.e. = 2 ) by recurrent analog neural nets, as long as the noise process a ecting the computation is bounded by a certain constant > 0. The basic idea of our proof is simply to rst construct a threshold logic network T recognizing the regular language under consideration, and then simulate T with a noisetolerant analog neural net. However, in order to obtain the tolerance vs. delay tradeo results in a uniform manner, we derive them as corollaries from a general result on simulating threshold logic networks by noisy recurrent analog neural nets. Consider a d-unit threshold logic network T (cf. Section 2) with transition function s(p; a) = sgn(Wp + h + ac), where W 2 Rd d is the weight matrix of T , h 2 Rd is the bias vector, and c 2 Rd is the input weight vector. Let us say that T has separation , if at each unit, the argument to the signum function is always at least away from zero, i.e., if jwiT p + hi + ciaj always holds, for every i = 1; : : :; d, p 2 f?1; 1gd, and a 2 f?1; 0; 1g. Any threshold logic network operating on the input alphabet f?1; 0; 1g may be modi ed to have some nonzero separation value by adjusting the bias values appropriately. An important special case are networks with integer weights, which may be adjusted to have separation 1. (On input values a 2 f?1; 1g this is straightforward; dealing with the value a = 0 may in some cases require modifying the network structure.)
4 Noisy Analog Neural Nets Recognize Regular Languages
Theorem 4.1 Let a language L f?1; 1g be recognized by some d-unit threshold logic network T with separation P 0, and let wmax be the maximum total input weight to any > unit of T (i.e., wmax = maxi j jwij j). Let be a constant satisfying < =wmax. Then L can also be recognized by a d-unit recurrent analog neural net N that has perfect reliability ( = 1 ) when a ected by any noise process Z bounded by . The activation function of N 2 may be any function satisfying (u) ! ?1 for u ! ?1 and (u) ! 1 for u ! 1.
10
Proof: The idea of the proof is simply to simulate the threshold logic network T with an analog neural network N by forcing the analog units to always operate close to saturation
(i.e. in states u such that (u) is within of 1, for some small constant ), so that they in e ect function as threshold logic units. This is achieved by multiplying the weights in N by a su ciently large constant m. Thus, let a language L be recognized by an d-unit threshold logic network T with transition function p+ = s(p; a) = sgn(Wp + h + ac), and separation . Let and u be constants such that the noise bound is < =wmax ? , and for all u u , j1 ? (u)j , and for all u ?u , j(?1) ? (u)j . Now consider the analog network N obtained from T by multiplying all the weights and thresholds by a constant
m
and replacing the signum nonlinearities by the sigmoids. We claim that N reproduces the behavior of T exactly, in the sense that the state of N at each time step, before noise is applied, is within of the corresponding state of T . Assume that the claim is true at some given time, when the state of T is some p 2 f?1; 1gd, and that of N correspondingly p = p + r, for some r 2 ? ; ]d. Consider then the update of ~ N rst with a noise vector e = q ? p, where q is generated according to some componentwise ~ ~ ~ -bounded noise density z (~; q ), and then with the network transition function p~
? wmax( + ) ;
u
p+ = ~
= =
(mW q + mh + mac) ~ (mW (p + r + e) + mh + mac) (m(Wp + h + ac) + mW (r + e)):
Considering the argument vector to the sigmoid componentwise, we obtain for each i = 1; : : :; d the bound:
jm(wiT p + hi + cia) + mwiT (r + e)j
m ? mwmax( + )
u:
By our choice of the value u , we are thus again assured that the components of the new state vector p+ of N are within of the corresponding components of the state vector p+ of ~ T . The claim follows by induction. One technicality concerning the choice of nal states in the network N still needs to be pointed out. Even though in the network T the nal states may be de ned as, say, FT = fp 2 f?1; 1gd j p1 = 1g, noise in the network N also a ects the state of the output unit, and so the nal states there should be de ned as FN = fp 2 ?1; 1]d j p1 1 ? g, if the noise is bounded by .
Corollary 4.2 For every regular language L f?1; 1g there is a constant > 0 such that
L can be recognized with perfect reliability (i.e. spite of any noise process Z bounded by .
1 = 2 ) by a recurrent analog neural net in
e.g. Minsky, 1972, pp. 55{57], one can easily construct from this automaton a threshold logic network T with 2m + 1 units that recognizes L. In Minsky's construction there is one 11
Proof: Let L be recognized by some nite automaton with m states. As presented in
threshold logic unit for each (state, input symbol) pair of the simulated automaton, plus one unit that tests for the acceptance condition. (Actually, our present model mandates testing also for input termination, which requires adding a few extra units.) A unit is activated (goes to state 1) when it receives an excitatory signal both from some preceding (state, symbol) unit and its input line. All the nonzero weights in T have absolute value 1, and the units have fan-in at most 2m + 1. Since this network satis es the conditions of Theorem 4.1 with = 1, wmax = 2m + 1, we may choose any value of < 1=(2m + 1). The next corollary shows that we can increase the noise tolerance of a network by slowing down the computation. Given an integer constant 1, let us say that a network N recognizes a language L with delay , if for every string x = a1 : : :ak 2 f?1; 1g , x 2 L if and only if N accepts the string a1 : : :ak (i.e. each input symbol ai is repeated times before the next one is presented). Corollary 4.3 For every regular language L f?1; 1g there is a constant delay value 1 such that for any < 1 , L can be recognized with delay with perfect reliability (i.e. = 2 ) 2 by a recurrent analog neural net that may be subject to any noise process Z bounded by . Proof: Let again L be recognized by some nite automaton with m states. The threshold logic units used in the simulation of Corollary 4.2 simply test for the simultaneous activity on any one of the lines coming from the preceding (state, symbol) units, and the appropriate input line. Thus, each such unit can be replaced by a tree of fan-in 2 OR gates, and a concluding AND gate. Considering that the maximum fan-in of the original units is 2m + 1, the AND-OR trees may be constructed to have height = dlog2 me + 2. The resulting network then has integer weights, with wmax = 2, and recognizes the language L with delay :
Remark 4.4 One can obtain di erent noise-tolerance vs. delay tradeo s using the recent more advanced simulations of nite automata by threshold logic networks ( Alon, Dewdney, Ott, 1991], Horne, Hush, 1996], Indyk, 1995]). For instance, Horne, Hush, 1996] presents a simulation of m-state nite automata by p 1, threshold logic networks with O( m log m) units, connection weights p and delay 4. Thus, one can in Corollary 4.3 achieve a noise-tolerance bound of = O(1= m log m) with delay = 4. Remark 4.5 The precise values of the bounds obtained above are proportional to the width of the interval used to encode unit states in the analog neural net model. The results are here formulated using the interval ?1; 1], and changes in this interval would have the proportional e ects on the values. For instance, if the interval 0; 1] were used (as in e.g. 1 Siegelmann, Sontag, 1991]), the bound in Corollary 4.3 would decrease from 2 to 1 . 4
5 A Novel Upper Bound for the VC-Dimension of Various Types of Neural Nets with Analog Noise
In this section we provide an example for the e ect of analog noise on discrete time analog computations with batch-input. We focus our attention on the most common types of analog 12
neural nets and show that in the presence of arbitrarily small amounts of analog noise there exists an upper bound for the VC-dimension of such neural nets that is independent of the total number of gates in the case of a feedforward architecture, and independent of the computation time in the case of a recurrent neural net. It only depends on the structure of the rst layer of the neural net (or alternatively of any other xed layer.) This novel type of upper bound depends apart from the analog noise only on those parameters of the net that are relevant for its rst computation step, and it holds for arbitrary real-valued batch-inputs and arbitrary real-valued \programmable parameters" (i.e. weights etc.). The resulting upper bounds for the required sample size of a noisy multi-layer sigmoidal neural net extend a preceding result by Haussler. He had shown in Corollary 3 in section 7 of Haussler, 1992] that even in the noise-free case an upper bound for the VC-dimension can be given that only depends on the maximal absolute value of weights for gates on layers 2 and on their maximal fan-in. In the present result all dependence on parameters that concern gates on layers 2 is removed. It will become obvious from the proof of Theorem 5.1 that our upper bound is actually of a quite general nature, and it can also be applied to various other models for discrete-time analog computation with analog noise that are not related to neural nets. The VC-dimension (abbreviated VC-dim(F )) of an arbitrary class F of functions f : Rn ! f0; 1g is de ned as follows. One says that F shatters a nite set S Rn if for every subset A S there exists a function f 2 F with f (x) = 1 for x 2 A and f (x) = 0 for x 2 S ? A. The VC-dimension of F is de ned as VC-dim(F ) := sup fjS j : S Rn is shattered by Fg. The VC-dimension of F may be viewed as a measure for the expressibility (or \degrees of freedom") of F . In particular it provides for arbitrary nite sets D Rn an upper bound of the form jDjO(VC-dim(F )) for the number of functions D ! f0; 1g that can be written as a restriction of a function in F to this nite domain D. As a consequence, the VC-dimension of F is the key parameter for estimating the number of randomly chosen examples that are needed to \learn" arbitrary target functions g : Rn ! f0; 1g from randomly chosen examples hx; g (x)i for g by a learning algorithm that uses functions from F as hypotheses (see Haussler, 1992], Vapnik, Chervonenkis, 1971], Blumer et al., 1989], Maass, 1995]). It should be noted that this does not only hold for the \classical" PAC-learning model where the target function g is required to belong to the class F , but according to Haussler, 1992] also in the general case of agnostic PAC-learning where g : Rn ! f0; 1g can be any function. Of course the latter case is much more relevant for the theory of learning with neural nets, where the class F of possible \hypotheses" is xed by the architecture of the neural net on which we run a learning algorithm, whereas the examples hx; g (x)i may arise from some arbitrary \real-world" classi cation problem for which we train the neural net. It is obvious from the results of Siegelmann, Sontag, 1991] and Maass, 1996] that there exist nite recurrent analog neural nets and nite recurrent networks of spiking neurons with batch-input and parameters from Q that have in nite VC-dimension (consider networks that can simulate a universal Turing machine, with each input bit-string encoded into a rational number). From the point of view of learning theory an in nite VC-dimension is commonly interpreted as information-theoretic evidence that there exists no \learning algorithm" for such networks (not even one with unlimited computation time). We will show in this section that this \anomaly" disappears as soon as one takes into account that the neural net is subject to analog noise, even if the amount of such noise is arbitrarily small. 13
For technical reasons we will talk in this section also about the pseudo-dimension P-dim(G ) of a class G of real-valued functions g : Rn ! R. One can de ne P-dim(G ) as the VCdimension of the following associated class
F := ff : Rn+1 ! f0; 1g : 9 g 2 G (f (x; y) = 1 if g(x) y and f (x; y) = 0 if g(x) < y)g
of Boolean-valued functions. Consider now the computation of a system M = h ; p0; F; Rn ; si on a batch input vector x 2 Rn, a ected by some piecewise equicontinuous noise process Z whose density function z has values in some range 0; B]. The distribution of states of M after k 1 computation steps is given by the density function xu (p), where jxj = 1 and u = tk?1 . For k > 1, R this density can be decomposed as q2 x (p0; q ) u (q; p) d , and for k = 1 we have simply xu (p) = x (p0; q ). This decomposition of the density function for the state-distribution of M will be essential for our subsequent results. We show in Theorem 5.1. that there exists a nite upper bound for the VC-dimension of the class F of functions computable by a class M of such systems M (which receive arbitrary real-valued batch input), that does not depend on the \complexity" of the class H of functions z ( ; ) that describe the second part of the computations of these systems M after their rst computation step. Let M be a class of such systems, a ected by the same piecewise equicontinuous noise process Z . For example M can be the class of systems M that result from di erent weightassignments to some feedforward or recurrent analog neural net with some xed architecture. Denote by G the class of all density kernels of the form (x; q ) := x (p0; q ) for systems M 2 M, and by H the class of density kernels of the form !(q; p) := u (q; p), for systems M 2 M and sequences u 2 ftg . (As a special case, we include also the constant function 1 in H.) Then all the Boolean functions computed with reliability by the systems M 2 M are included in the class F of functions f : Rn ! f0; 1g that are composed of a function 2 G and a function ! 2 H so that for any x 2 Rn the integral R R 1 p2F q2 (x; q ) ! (q; p) d d has a value 2 + if f (x) = 1; (2) and else a value 1 ? : 2 Actually, the class F contains somewhat more than just the functions computed by systems from M, because the two component functions and ! in (2) may come from two di erent systems in M (for example from two di erent weight-assignments to a recurrent analog neural net). In Theorem 5.1 we consider an even more general set-up where one has two bounded state sets Rd and 0 Rd0 , measures over and 0 over 0 , as well as a Borel set F 0 of accepting nal states. (In applications is typically the set of possible intermediates states after a xed number l (e.g. l = 1) of computation steps, and 0 is the set of possible output states of a computation. One has d 6= d0 if for example the number d of units on the rst hidden layer of a feedforward sigmoidal neural net di ers from the number d0 of output nodes of the net; see Corollary 5.3.). We assume in Theorem 5.1 that G is an arbitrary class of piecewise equicontinuous density kernels : Rn ! 0; B ] with uniformly bounded moduli of continuity (as in condition (1)), 0 ! R+ , that > 0 is an arbitrary that H is an arbitrary class of density kernels ! : given parameter, and that F is the class of functions f : Rn R! fR ; 1g for which there exist 0 functions 2 G and ! 2 H so that for any x 2 R the integral p2F q2 (x; q ) ! (q; p) d d 0 1 has a value 2 + if f (x) = 1 , and otherwise a value 1 ? . 2 14
Because of our assumption about the function class G , one can (as in the proof of Theorem 3.1) superimpose on the space a nite grid G, such that for any ; ~ 2 G and x; x 2 Rn: ~ j (x; q) ? ~(~; q)j =5 ( ) for all q 2 G implies that x R x q2 j (x; q ) ? ~ (~; q )j d < =2: The size jGj of the grid (i.e. the number of gridpoints) depends in general on the reliability parameter , the common moduli of continuity of the functions in G , and the volume and shape of the state space .
Theorem 5.1 Let G , H, and F be function classes as speci ed above, and assume in addition that the class G has nite pseudo-dimension . Then one can give a nite upper bound for the VC-dimension of F in terms of , B , jGj, , and ( ). Obviously this bound does not depend on the complexity of the function class H (except via parameters related to the state
set ) .
of these functions to the nite domain S G. We also consider for := =10 ( ) and any class A of functions with range R+ the class A of all \ -discretizations" g of functions g 2 A, where g (z) := g(z) for any z in the domain of g :
Proof: Let S Rn be some arbitrary nite set shattered by F . For any subset A S we x functions A 2 G and !A 2 H so that for any x 2 S the integral R R 1 0 p2F q2 A (x; q ) !A (q; p) d d has a value 2 + if x 2 A; and else a value 1 ? : 2 We write GS for the class of all functions A 2 G for A S , and GS for the class of restrictions
In particular for the class GS the functions A 2 GS map S G into f0; : : :; b ? 1g for b := bB= c + 1. One should note that by our assumptions on G , for any ; ~ 2 G and any x; x 2 S ~ R x the condition 8q 2 G (j (x; q ) ? ~ (~; q )j 1) implies that j (x; q ) ? ~ (~; q )j d < =2. x One can get an upper bound for the \complexity" of GS by applying to GS a generalization of \Sauer'sP ? due to Alon et al., 1993]. Given integers m, b, and , de ne Lemma" (m; b; ) := log2 i=1 m bi . Lemma 15 of Alon et al., 1993] states that if A is any class i of functions obtained as the discretizations of the functions in a class A of pseudo-dimension , such that the functions in A have a domain D of size m and range f0; : : :; b ? 1g, then A must contain an \L1 2-cover" B A of size at most
jB j 2 (mb2) (m;b; ): ~ That is, for every f 2 A there is some f 2 B such that jf (z ) ? f~(z )j < 2 (and hence 1) for every z 2 D. (The result holds for general values of , , m, and b.) Applied to our context (with A := GS ) this result implies that there exists a set G GS whose cardinality can be bounded in terms of the pseudo-dimension of G as jG j 2 (jS j jGj b2) (jSj jGj;b; ); (3)
15
such that for every 2 GS there exists some ~ 2 G with j (x; q ) ? ~ (x; q )j 1 for all x 2 S and all q 2 G. With the help of the 2-cover of GS which is induced by G we can now show that the cardinality jS j of the shattered set S can be bounded through the inequality
(4) 2jS j 2 (jS j jGj b2 ) (jS j jGj;b; ) 2bjGj : It is obvious that this inequality yields an upper bound for jS j which does not depend on the complexity of the function class H (except for parameters related to ) . Let us consider for each ! 2 H the discrete map ! : f0; : : :; b ? 1gG ! f0; 1g which is ^ induced by ! through the following de nition: !(^) has value 0 for ^ 2 f0; : : :; b ? 1gG ^ if there exist some 2 G and x 2 S with j^(q) ? (x; q)j 1 for all q 2 G and R R 0 1? : p2F q2 (x; q ) ! (q; p) d d 2 Else we set ! (^ ) = 1: ^
Since we have the upper bound (3) on the size of the cover G , and there exist at most 2bjGj di erent functions ! , it su ces for proving (4) to show that the following claim holds. ^ Claim: Let A1; A2 S . If some function ~ 2 G covers both A1 and A2 , in the sense that j ~ (x; q ) ? A1 (x; q )j 1 and j ~ (x; q ) ? A2 (x; q )j 1 for all x 2 S and all q 2 G, and moreover !A1 = !A2 , then A1 = A2. ^ ^ In order to prove this claim, let us assume that A1 6= A2 , but both A1 and A2 are covered by the same function ~ 2 G . We shall show that then !A1 6= !A2 . ^ ^ Fix some x0 2 S so that either x0 2 A1 ? A2 or x0 2 A2 ? A1 ; without loss of generality we may assume that x0 2 A1 ? A2 . Let ^ : G ! f0; : : :; b ? 1g be de ned by ^ (q ) = ~ (x0; q ). Then we have !A2 (^ ) = 0, since by assumption j ~ (x0; q ) ? A2 (x0 ; q )j 1 for all q 2 G and ^ Z Z 1? : A2 (x0; q ) !A2 (q; p) d d 0 2 p2F q2 Assume for a contradiction that also !A1 (^ ) = 0 for this function ^ . This implies that ^ there exist some 2 G and some x1 2 S with Z Z (x1; q ) !A1 (q; p) d d 0 1 ? and (5) 2 p2F q2 j^ (q) ? (x1; q)j 1 for all q 2 G: The latter implies by our choice of G and and the de nition of ^ that
Z
On the other hand the assumptions on ~ 2 G imply that j ~ (x0 ; q ) ? A1 (x0 ; q )j 1 for all q 2 G, hence
Z
q2
j~(x0; q) ? (x1; q)j d j~(x0; q) ?
A1 (x0; q )j d
<
=2:
(6)
q2
<
=2 :
(7)
16
Furthermore since x0 2 A1 , we have by choice of !A1 that
Z Z
p2F q2
Z
A1 (x0; q )
!A1 (q; p) d d 0
A1 (x0 ; q )j d
Z Z
1+ : 2
(8)
The inequalities (6) and (7) imply that
q2
Z Z
j (x1; q) ?
< :
This inequality yields in combination with (5) and (8) the contradiction
0 (x1; q ) !A1 (q; p) d d 0 ? A1 (x0; q ) !A1 (q; p) d d p2FZ q2 p2F q2 Z j (x1; q) ? A1 (x0; q)j !A1 (q; p) d d 0
=
Z
p2F q2
Z
q2 q2
j (x1; q) ? j (x1; q) ?
A1 (x0 ; q )j A1 (x0 ; q )j
Z
< :
d
p2F
!A1 (q; p) d 0 d
This contradiction implies that !A1 (^ ) = 1, hence !A1 6= !A2 . Thus we have veri ed the ^ ^ ^ preceding claim, and the proof of Theorem 5.1 is now complete.
Remark 5.2 It follows from Alon et al., 1993] that instead of a nite upper bound for the pseudo-dimension of G it su ces for Theorem 5.1 to assume a nite upper bound for the -dimension P -dim(G ) of G for = =20 ( ). Corollary 5.3 There exists a nite upper bound for the VC-dimension of layered feedforward
sigmoidal neural nets and feedforward networks of spiking neurons with piecewise equicontinuous analog noise (for arbitrary real-valued inputs, Boolean output computed with some arbitrary reliability > 0, and arbitrary real-valued \programmable parameters") which does not depend on the size or structure of the network beyond its rst hidden layer.
layered feedforward sigmoidal neural nets with n input-nodes, d units on their rst hidden layer, and d0 output nodes. Thus the nets in N may have arbitrary numbers of layers and gates, and arbitrary real-valued weight-assignments. We assume that the d gates on the rst layer are a ected by some piecewise equicontinuous noise process with density kernel z : 2 ! R+ , where := ?1; 1]d . Let F : Rn Rm ! be the function whose value F (x; w) is the vector of outputs of the d rst hidden layer units (without noise), for arbitrary network inputs x 2 Rn and arbitrary assignments w 2 Rm to the weights and biases of these units. We take as the class G of functions considered in the proof of Theorem 5.1 all functions of the form (x; q ) = z (F (x; w); q ) for arbitrary parameters w 2 Rm . The results presented in Karpinski, Macintyre, 1995] imply that the pseudo-dimension of this class G of functions is 17
Proof: We rst consider for some arbitrary given parameters n; d; d0 2 N the class N of all
bounded by a polynomial in m, for all common choices of activation functions of the sigmoidal units and all practically relevant density kernels z for the noise process (even involving the exponential function). In the case where the activation functions and density kernels are piecewise polynomial one can apply the results of Goldberg, Jerrum, 1995] to get a slightly better nite upper bound for the pseudo-dimension of G . We de ne for = ?1; 1]d and 0 = ?1; 1]d0 the class H as the class of all density kernels 0 ! R+ that describe the computations of the remaining layers of networks in N !: with arbitrary noise processes (and arbitrary real-valued weights). It follows from Theorem 5.1 that the nite VC-dimension bound obtained for the class F of functions computed with reliability > 0 by networks in the class N does not depend on the complexity of the function class H , and hence not on the number of layers, the number of units beyond the rst layer, or the noise process on later layers of these networks. In the case of a network N of noisy spiking neurons the programmable parameters consist of the \weights" of synapses, time-delays for postsynaptic potentials, and parameters which determine other aspects of the functional form of response functions (i.e. postsynaptic potentials) and threshold functions. The pseudo-dimension of the class G which arises when one applies (as described in Section 2) the here considered framework to the rst layer of a network N of noisy spiking neurons can be bounded with the help of the same tools as for the case of sigmoidal neural nets.
Corollary 5.4 There exists a nite upper bound for the VC-dimension of recurrent sigmoidal neural nets and networks of spiking neurons with analog noise (for arbitrary real valued inputs, Boolean output computed with some arbitrary reliability > 0, and arbitrary real valued \programmable parameters") which does not depend on the computation time of the network, even if the computation time is allowed to vary for di erent inputs.
now consists of the class of all state-distributions that arise from the rst computation step of the total network, and H consists of all possible state-transformations that can arise from the rest of the computations of the same network.
Proof: One proceeds in the same manner as for the proof of Corollary 5.3, except that G
6 Conclusions
We have introduced a new framework for the analysis of analog noise in discrete-time analog computations that is better suited for \real-world" applications and more exible than previous models. In contrast to preceding models it also covers important concrete cases such as analog neural nets with a Gaussian distribution of noise on analog gate outputs, noisy computations with less than perfect reliability, and computations in networks of noisy spiking neurons. Furthermore we have introduced adequate mathematical tools for analyzing the e ect of analog noise in this new framework. These tools di er quite strongly from those that have previously been used for the investigation of noisy computations. We show that they provide new bounds for the computational power and VC-dimension of analog neural nets and networks of spiking neurons in the presence of analog noise. 18
Finally we would like to point out that our model for noisy analog computations can also be applied to completely di erent types of models for discrete time analog computation than neural nets, such as arithmetical circuits Turan, Vatan, 1994], the random access machine (RAM) with analog inputs, the parallel random access machine (PRAM) with analog inputs, various computational discrete-time dynamical systems ( Moore, 1990], Koiran et al., 1994], Asarin, Maler, 1994], Orponen, Matamala, 1996]) and (with some minor adjustments) also the BSS model Blum, Shub, Smale, 1989], Koiran, 1993]. Our framework provides for each of these models an adequate de nition of noise-robust computation in the presence of analog noise, and our results provide upper bounds for their computational power and VC-dimension in terms of characteristics of their analog noise.
Acknowledgement
We would like to thank Peter Auer for helpful conversations.
References
Alon et al., 1993] N. Alon, N. Cesa-Bianchi, S. Ben-David, D. Haussler, Scale-sensitive dimensions, uniform convergence, and learnability. Proceedings of the 34th Annual IEEE Symposium on Foundations of Computer Science, 292{301. IEEE Computer Science Press, New York, NY, 1993. Alon, Dewdney, Ott, 1991] N. Alon, A. K. Dewdney, T. J. Ott, E cient simulation of nite automata by neural nets. J. Assoc. Comput. Mach. 38 (1991), 495{514. Anderson et al., 1977] J. A. Anderson, J. W. Silverstein, S. A. Ritz, R. S. Jones, Distinctive features, categorical perception, and probability learning: some applications of a neural model. Psychological Review 84 (1977), 413{451. Reprinted in Anderson, Rosenfeld, 1988], pp. 287{325. Anderson, Rosenfeld, 1988] J. A. Anderson and E. Rosenfeld (eds.), Neurocomputing: Foundations of Research. The MIT Press, Cambridge, MA, 1988. Asarin, Maler, 1994] E. Asarin, O. Maler, On some relations between dynamical systems and transition systems. Proceedings of the 21st International Colloquium on Automata, Languages, and Programming, 59{72. Lecture Notes in Computer Science 820, SpringerVerlag, Berlin, 1994. Blum, Shub, Smale, 1989] L. Blum, M. Shub, S. Smale, On a theory of computation over the real numbers: NP-completeness, recursive functions and universal machines. Bulletin of the Amer. Math. Soc. 21 (1989), 1{46. Blumer et al., 1989] A. Blumer, A. Ehrenfeucht, D. Haussler, M. K. Warmuth, Learnability and the Vapnik-Chervonenkis dimension. J. Assoc. Comput. Mach. 36 (1989), 929{965. Casey, 1996] M. Casey, The dynamics of discrete-time computation, with application to recurrent neural networks and nite state machine extraction. Neural Computation 8 (1996), 1135{1178. 19
Cowan, 1966] J. D. Cowan, Synthesis of reliable automata from unreliable components. Automata Theory (E. R. Caianiello, ed.), 131{145. Academic Press, New York, 1966. Feller, 1971] W. Feller, An Introduction to Probability Theory and Its Applications, Vol. 2, Second Edition. J. Wiley & Sons, New York, NY, 1971. Frasconi et al., 1996] P. Frasconi, M. Gori, M. Maggini, G. Soda, Representation of nite state automata in recurrent radial basis function networks. Machine Learning 23 (1996), 5{32. Gal, 1991] A. Gal, Lower bounds for the complexity of reliable Boolean circuits with noisy gates. Proceedings of the 32th Annual IEEE Symposium on Foundations of Computer Science, 594{601. IEEE Computer Science Press, New York, NY, 1991. Gerstner, van Hemmen, 1994] W. Gerstner, J. L. van Hemmen, How to describe neuronal activity: spikes, rates or assemblies? Advances in Neural Information Processing Systems 6, 463{470. Morgan Kaufmann, San Mateo, CA, 1994. Goldberg, Jerrum, 1995] P. W. Goldberg, M. R. Jerrum, Bounding the Vapnik-Chervonenkis dimension of concept classes parameterized by real numbers. Machine Learning 18 (1995), 131{148. Haussler, 1992] D. Haussler, Decision theoretic generalizations of the PAC-model for neural nets and other learning applications. Information and Computation 100 (1992), 78{150. Hopcroft, Ullman, 1979] J. E. Hopcroft, J. D. Ullman, Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, Reading, MA, 1979. Horne, Hush, 1996] B. G. Horne, D. R. Hush, Bounds on the complexity of recurrent neural network implementations of nite state machines. Neural Networks 9 (1996), 243{252. Indyk, 1995] P. Indyk, Optimal simulation of automata by neural nets. Proceedings of the 12th Annual Symposium on Theoretical Aspects of Computer Science, 337{347. Lecture Notes in Computer Science 900, Springer-Verlag, Berlin, 1995. Karpinski, Macintyre, 1995] M. Karpinski, A. Macintyre, Polynomial bounds for VCdimension of sigmoidal and general Pfa an neural networks. J. of Computer and System Sciences, to appear. Kifer, 1988] Y. Kifer, Random Perturbations of Dynamical Systems. Birkhauser, Boston, MA, 1988. Koiran, 1993] P. Koiran, A weak version of the Blum, Shub and Smale model. Proceedings of the 34th Annual IEEE Symposium on Foundations of Computer Science, 486{495. IEEE Computer Science Press, New York, NY, 1993. Koiran et al., 1994] P. Koiran, M. Cosnard, M. Garzon, Computability with low-dimensional dynamical systems. Theoret. Comput. Sci. 132 (1994), 113{128. Maass, 1995] W. Maass, Vapnik-Chervonenkis dimension of neural nets. The Handbook of Brain Theory and Neural Networks (M. A. Arbib, ed.), 1000{1003. The MIT Press, Cambridge, MA, 1995. 20
Maass, 1996] W. Maass, Lower bounds for the computational power of networks of spiking neurons. Neural Computation 8 (1996), 1{40. Maass, 1997] W. Maass, Fast sigmoidal networks via spiking neurons, Neural Computation 9, (1997), 279{304. Minsky, 1972] M. L. Minsky, Computation: Finite and In nite Machines. Prentice-Hall, Englewood Cli s, NJ, 1972. Moore, 1990] C. Moore, Unpredictability and undecidability in physical systems. Phys. Review Letters 64 (1990), 2354{2357. Omlin, Giles, 1996] C. W. Omlin, C. L Giles, Constructing deterministic nite-state automata in recurrent neural networks. J. Assoc. Comput. Mach. 43 (1996), 937{972. Orponen, Matamala, 1996] P. Orponen, M. Matamala, Universal computation by nite twodimensional coupled map lattices. Proceedings of the Workshop on Physics and Computation, PhysComp'96, 243{247. New England Complex Systems Institute, Boston, MA, 1996. Paz, 1971] A. Paz, Introduction to Probabilistic Automata. Academic Press, New York, NY, 1971. Phatak, Koren 1995] D. S. Phatak, I. Koren, Complete and partial fault tolerance of feedforward neural nets. IEEE Transactions on Neural Networks 6 (1995), 446{456. Pippenger, 1989] N. Pippenger, Invariance of complexity measures for networks with unreliable gates. J. Assoc. Comput. Mach. 36 (1989), 531{539. Rabin, 1963] M. Rabin, Probabilistic automata. Information and Control 6 (1963), 230{245. Siegelmann, 1994] H. T. Siegelmann, On the computational power of probabilistic and faulty networks. Proceedings of the 21st International Colloquium on Automata, Languages, and Programming, 23{34. Lecture Notes in Computer Science 820, Springer-Verlag, Berlin, 1994. Siegelmann, Sontag, 1991] H. T. Siegelmann, E. D. Sontag, Turing computability with neural nets. Appl. Math. Letters 4(6) (1991), 77{80. Cf. also: H. T. Siegelmann, E. D. Sontag, On the computational power of neural nets. J. of Computer and System Sciences 50 (1995), 132{150. Turan, Vatan, 1994] G. Turan, F. Vatan, On the computation of Boolean functions by analog circuits of bounded fan-in. Proceedings of the 35th Annual IEEE Symposium on Foundations of Computer Science, 553{564. IEEE Computer Science Press, New York, NY, 1994. Vapnik, Chervonenkis, 1971] V. N. Vapnik, A. Y. Chervonenkis, On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications 16 (1971), 264-280.
21
von Neumann, 1956] J. von Neumann, Probabilistic logics and the synthesis of reliable organisms from unreliable components. Automata Studies (C. E. Shannon, J. E. McCarthy, eds.), 329{378. Annals of Mathematics Studies 34, Princeton University Press, Princeton, NJ, 1956.
22