当前位置:首页 >> 英语学习 >>

The transition from asynchronous to synchronous system operation An approach for distribute


The Transition from Asynchronous to Synchronous System Operation: An Approach for Distributed Fault-Tolerant Systems ?
Wilfried Steiner Michael Paulitsch

Real-Time Systems Group, Technische Universit¨ t Wien a Treitlstr. 3/182, A-1040 Vienna, Austria, {willi,michael}@vmars.tuwien.ac.at Abstract
Immediately after power-up, synchronous distributed systems need some time until essential timing properties, which are required to operate correctly, are established. We say that synchronous systems are initially in asynchronous operation. In this paper, we present an algorithm and architectural guidelines that assure the transition from asynchronous to synchronous operation within a bounded duration even in case of failures. scribes three different start-up algorithms and does performance tests using simulation. If a synchronous systems is to be used in safety-critical applications, timeliness guarantees need to be given also during execution of the startup algorithm. These guarantees must hold even in presence of faults and basically provide an upper bound on the time needed to establish synchronous operation. It is the objective of this paper to present an algorithm and architectural guidelines that bring embedded faulttolerant distributed systems from an asynchronous state after power-up into synchronous operation within a bounded duration. The timeliness of the algorithm will also be guaranteed in case of failures. This algorithm and guidelines will be evaluated using a time-triggered synchronous system, but may also be used for other synchronous systems. The paper begins with a description of an architecture for synchronous operation. In Section 3, we describe a start-up algorithm. Section 4 discusses different failure scenarios in the start-up phase and possible solutions to handle the presented failures. We ?nally conclude in Section 5.

1 Introduction
Asynchronous distributed systems do not require any assumptions concerning the timing behavior of system components for the system to operate correctly. Synchronous distributed systems, on the other hand, require that assumptions concerning the timing behavior of components are met by system components. Algorithms built on top of synchronous systems can take advantage of these assumptions and allow timeliness guarantees. Generally speaking, algorithms are easier to implement on top of synchronous systems than on top of asynchronous systems; some algorithms even cannot be implemented on asynchronous systems [2, 4]. Work and research has been done that takes advantage of and focuses on the properties of synchronous and asynchronous systems, respectively, such as [8, 3]. The work on synchronous systems focuses mainly on the operation of distributed systems in a state where all system properties are already guaranteed. However, after system start every system is basically asynchronous. Consequently, special algorithms, which we call start-up algorithms, must bring a system from asynchronous into synchronous operation. Work in this ?eld has been done, e.g., by L¨ nn who simulated algorithms that o lead to a synchronous system after power-up [10]. He de? This work has been supported by the European projects DSoS (proj. No IST-1999-11585), PAMELA (proj. No G4RD-CT1999-00086), and NEXT TTA (proj. No IST-2001-32111). We thank G¨ nther Bauer for u fruitful discussions.

2 Architecture for Synchronous Operation
This section starts with a de?nition of a synchronous system. We then give an example of an architecture that meets the presented attributes, and describe a fault-tolerant architecture that operates under a given fault hypothesis.

2.1 Structure of a Distributed System
We identify the following components in our system: Node: A node is a physical unit and a unit of failure that comprises a communication controller and a computing resource. A node creates, analyzes, sends, and receives messages and performs computational activities. Communication Medium: A communication medium is a connection between nodes for the purpose of message transmission. Figure 1 depicts a system of nodes as an example.

node 0

node 1

communication medium

These identi?ers are called node ID and slot ID, respectively. The slots are ordered by their identi?ers in ascending order. By de?nition the node ID equals the slot ID and, thus, a sending node is also identi?ed by its slot. Figure 2 depicts the access scheme of a system with nodes and its numbering.
TDMA round (n-1) TDMA round n
assigned to node3 assigned to node0 assigned to node1 assigned to node2 assigned to node3 assigned to node0

node 2 node 3

TDMA round (n+1)

Figure 1. Distributed system with 4 nodes

2.2 Characteristics of Synchronous Systems
A distributed system is classi?ed as synchronous if it satis?es the following conditions [5]: Bounded Transmission Delay: There is a known upper bound ? on message delay. ? consists of the time it takes for sending, transporting, and receiving a message over a communication medium. Bounded Clock Drift: Every node ? has a local clock ? with known bounded rate of drift ? ? with respect to physical time. Bounded Processing Time: There are known upper and lower bounds on the time required by a process to execute a processing step.

12:00 slot3 slot0

3:00 slot1

5:00 slot2

9:00 slot3

12:00 slot0

3:00

Figure 2. Access scheme of a TDMA system with 4 nodes In order to work correctly, each node must have a local copy of the global transmission schedule, so each node knows when to send and when the others send and, thus, when to receive frames. In the described system, a frame contains an arbitrary number of messages, however, as already mentioned, a node can only send ? frame within ? TDMA round. Each frame carries the following information: Global Time. the slot. Node ID. The current global time at the beginning of

2.3 An Example of a Synchronous Architecture
We use a time-division multiple-access (TDMA) architecture such as the Time-Triggered Architecture (TTA) and its safety-critical communication protocol TTP/C [7] as an example of a synchronous system. A TDMA architecture obtains its synchronous behavior from the progression of real time, i.e., we assume the availability of a global system time, which is used for the arbitration of the communication medium. In a time-triggered system this global time is established using the local clocks of the nodes. Nevertheless, synchronous systems can also be constructed without knowledge of real time, e.g., with token-passing mechanisms. In an architecture using a TDMA scheme, time is split up into (non-overlapping) pieces of not necessarily equal durations, which are called sending slots (for short called slots). In the architecture we focus on, each of the ? nodes of the system is assigned a priori one unique slot, where ? equals the number of nodes of the system. When the time of a node’s slot is reached, the node is allowed to broadcast one frame on the communication medium for the duration of the slot, ×??? whereas ? ? . The sequence of the slots of a system is called a TDMA round. After the end of one TDMA round the next TDMA round starts, i.e., after the sending of the node in the last slot of a TDMA round, the node that is allowed to send in the ?rst slot sends again. Consequently, each node starts sending every ???? time ?? ?? ×??? . units, whereas ???? ? We identify each node and slot by a unique integer; the numbering starts at ? and is incremented by ? for each node.

The node ID of the sending node. 1

Data. The messages that a node sends. Since the transmission pattern is de?ned and coordinated among the nodes and the delay of the transport itself is given by the communication medium and its physical length and has an upper bound, there exists a calculable upper bound on the transmission of frames. We assume that the hardware on which the architecture is running has clocks with known bounded drift rates and the execution times of computations and data accesses have an upper bound. This allows us to conclude that the execution time of the process at the node that performs computational activities, e.g., in order to generate the messages, has an upper bound. With respect to the points mentioned in Section 2.2, we can say that the described architecture is synchronous.

2.4 Fault Tolerance in Synchronous Operation
For all modes of operation, we assume that the following fault hypothesis holds: Only ? single fault occurs during startup phase affecting at most ? node and/or a part of the communication medium in spatial proximity. The singlefault hypothesis is justi?ed, because the architecture aims at systems where the inter-arrival time of failures is higher than the system recovery time. We assume that nodes are not in spatial proximity.
1 Due to the node ID, the delay of the transmission can be corrected precisely. This becomes important for the startup of large-scale systems.

2.4.1 Masking of Failures Since a fault can affect a part of a communication medium, the communication medium consists of two replicated channels. The channels are routed physically separately in order to avoid both channels being affected by a single fault. In order to be able to detect failures of the communication medium in the value domain each frame is protected by a checksum. A received frame is said to be correct at a receiving node if the calculation of the checksum by the receiving node equals the checksum of the frame. Otherwise it is said to be incorrect. A failure of a node is tolerated (i.e., it is masked to the application) by replication of nodes and the use of, e.g., Triple-Modular Redundancy (TMR) and correct voting algorithms [8]. 2.4.2 Fault Isolation A fault containment region (FCR) is ”a set of components that is considered to fail (a) as an atomic unit, and (b) in a statistically independent way with respect to other FCRs” [6]. In order to assure that a node is an FCR, i.e., that a fault affects only a node and that a failure that occurred in a component is not able to affect another component, we have to provide fault isolation. E.g., a fault that may affect both channels can occur close to a node where the two channels are in spatial proximity, because there they are connected to the node. An example for a fault affecting both channels is the short circuit of two channels close to a node. We use a star topology and assume the placement of devices into the star center of each of the channels, which isolate failures, to ensure that the consequences of such a failure will in?uence the operation of at most ? node. We call such a device a guardian. Actually, there must be at least ? guardians to cope with the faults of the single-fault hypothesis. Moreover, a correct guardian must not be able to produce correct frames on its own [12, 1].
node 0 node 1

order to provide a consistent view on a frame to all other nodes of a cluster and to avoid Slightly-off-Speci?cation (SoS) failures [9].

3 The Transition from Asynchronous to Synchronous Operation
In the preceding sections, we have explained the normal operation of the system, i.e., the synchronous operation. In this section, we will focus on the transition from asynchronous to synchronous operation. After power-up, the clock of a node is initially asynchronous to the one of other nodes, we say a node is asynchronous with respect to the synchronized system time or in asynchronous operation. If nodes were left in asynchronous operation after power-up, collisions of frames could occur due to the missing coordination of the nodes’ communication medium access. This would lead to an unboundable transmission delay. Thus, the provision of an algorithm that synchronizes the different times of the nodes after startup is essential for the operation of synchronous systems. We call such an algorithm startup algorithm. The subject of this section is to describe a startup algorithm.

3.1 Start-Up Algorithm
Init Listen Cold Start Active

1

2

3

4

Figure 4. Finite State Machine of the Startup Algorithm For the startup algorithm, we de?ne three timeouts: ×? ???? is unique to each node. It is Startup Timeout given by the duration of all TDMA slots prior to the slot of the node with the node ID beginning at a TDMA round start. ? ? ×? ???? ? ×??? (1)

guardian 0

guardian 1

communication medium node 2 node 3

?? ×? ?? of a node is given by the Cold Start Timeout sum of the node’s Startup Timeout ×? ???? and a complete TDMA round ???? : ?? ×? ?? ????
·

?

??

?

Figure 3. System of 4 nodes with replicated channels and guardians The task of the guardian is to prevent a malicious node from sending outside its slot, which prevents a failure in the time domain. We assume that the guardian has access to the global transmission schedule and the system structure. This enables a guardian to analyze a frame that is sent from a node and to check whether this frame contains information that may lead to a failure at another node, such as the wrong time. Furthermore, a guardian performs signal reshaping in

×? ????

(2)

? ×? ? of a node i is given by the sum of Listen Timeout the node’s Startup Timeout ×? ???? and two TDMA rounds ? ? ???? : ? ×? ?
?

? ????

·

×? ????

(3)

To explain the system startup we refer to the ?nite state machine in Figure 4. We distinguish between the following states: Init, Listen, Cold Start, Active.

Init State: After power-up, a node starts in Init State. After its internal initialization it transits to Listen State. Listen State: The node starts its Listen Timeout and begins listening on the bus. If an incorrect frame or any noise has been received while listening, the Listen Timeout is restarted. If a correct frame has been received, the receiving node accepts the global time and node ID of the incoming frame. Then, the node enters Active State. If no traf?c has been detected on the communication medium while the node was listening and the Listen Timeout expires, the node enters Cold Start State. Cold Start State: The node generates a frame using its local time as global time and its node ID (we call such a frame a Cold Start Frame). The node then starts its Cold Start Timeout, begins sending the Cold Start Frame on the communication medium, proceeds the transmission schedule for one TDMA round, and listens for a frame on the communication medium. If during this TDMA round no frames from other nodes have been received, the node expects that no other node was in Listen State. It waits for its Cold Start Timeout to expire (i.e., it waits for the duration of the remaining unique Startup Timeout) and sends another Cold Start Frame. If during this TDMA round a correct frame has been received and the global time in this frame equals the node’s view of the global time, the node enters the Active State. The node enters Listen State if it received a faulty frame, other noise has been detected, or a correct frame that contains a global time different from the node’s view of the global time. Active State: In Active State the node is in synchronized operation and proceeds its transmission schedule cyclically, i.e., it waits until the time for its slot is reached and starts sending, as described in Section 2.3.

We de?ne ?? ×? ? as the point in time, when node i starts its Listen Timeout ? ×? ? . Thus, the point in time to send the ?rst Cold Start Frame, ? ?? ×? ??? , is given per node by:
? ??

×? ???

??

×? ? · ? ×? ?

(4)

We de?ne the class of nodes, ? ?? ×? ?? , that send their ?rst Cold Start Frames that collide as follows:

? ?? ×? ??

? ?? ?

×? ??? ? ? ??? ?? ×? ???? ? ??? ?? ×? ??? ? · ? ?

(5)

3.2 Analysis of the Startup Algorithm
In this section, we analyze the startup algorithm in the failure-free case. 3.2.1 Collision When at least two nodes start to send on the communica?, which is given by tion medium within an interval ? the communication medium’s propagation delay, the frames collide, which is realized as noise at the receivers. In the startup algorithm presented in Section 3.1, there exists just one class of scenarios for collision, which depends on the nodes’ power-up sequence. We will show that in the fault-free case there can be at most one collision. We assume that the drift rates of the clocks can be neglected (the circumstances where this assumption holds will be described in Section 3.2.2). Furthermore, we assume that a node cannot send and receive concurrently. While this is technically feasible, it imposes additional cost to the bus driver and/or cabling depending on the physical architecture.

whereas ? ? and ? ? , ? equals the number of nodes of the system, and ? represents the maximum propagation delay of the communication medium. We refer to the nodes of the class ? ?? ×? ?? as Cold Start Nodes. In scenarios in which ? ?? ×? ?? ? no collision occurs, because in the interval ? ??? ?? ×? ??? ? ? ??? ?? ×? ??? ? · ? ? only one node sends a Cold Start Frame and this ensures that no other node sends before it has received the sent Cold Start Frame. ?, the sent Cold In scenarios in which ? ?? ×? ?? Start Frames collide. The sending nodes do not recognize that there was a collision, for they cannot listen on the communication medium while sending a frame. They proceed their transmission pattern for one TDMA round. The listening nodes receive noise due to the collision and restart their Listen Timeouts. After one TDMA round the Cold Start Nodes wait for their Cold Start Timeouts to expire while they are listening on the communication medium. The node with the shortest Startup Timeout is then allowed to send its second Cold Start Frame. All other Cold Start Nodes, which still proceed their Startup Timeouts, receive the sent Cold Start Frame, interpret it as a sign that a collision occurred, and fall back into Listen State. Every node that is in Listen State at this time transits into Active State. We see that in the collision scenarios the node with the shortest Startup Timeout wins. The following ?gures depict the timing of nodes during startup. The start of a transmission of a node is symbolized by a thick line, collisions are symbolized by a ?ash-sign.
Listen Timeout n1 1
n1:
Cold Start Timout n1

2

3

0

1

2

3

0

1

2

3

0

n2:

2

3

0

1

2

3

0

1

2

3

0

1

Listen Timeout n2

Cold Start Timout n2

Real Time

t1

t2

Figure 5. Collision of frames of node 1 and node 2 in a system of 4 nodes Figure 5 present a collision scenario of a system of nodes. We expect that node ? is started ?rst and after an ×? ×? interval of ? ???? ? ? ???? node ? is started. As one can see, these two nodes collide at point ?? . Node ?, which has

the shorter Startup Timeout, is the ?rst one to send its next Cold Start Frame at ?? and forces node ? back into Listen State. 3.2.2 Clock Drifts The subject of this section is to show that clock drifts can cause additional collisions if they are more than a speci?c bounding value. In order to determine this bounding value, we will derive the maximum duration of a TDMA round as a function of the maximum clock drift rate, the minimum slot duration, and the maximum propagation delay of the communication medium. Figure 6 depicts a collision scenario caused by clock drifts. In contrast to Section 3.2.1, we expect the clocks of node ? and node ? to drift. We can also see a reference clock with drift rate ?. According to the reference clock, node ? would send its Cold Start Frame at ?? and node ? would send its Cold Start Frame at ?? . Node ?, however, has a slower clock than the reference clock and, thus, sends its Cold Start Frame at ? . Node ?, however, has a faster clock than the reference clock and, thus, ends its Cold Start Timeout at ? , and also sends a Cold Start Frame on the communication medium. As we can see, we can construct a scenario in which clock drifts can cause additional collisions.
Listen Timeout n2 2
drift of node 2 = +ρmax

?

Node (? ? ?) and node (? ? ?) collide, because they have the longest Cold Start Timeouts and, thus, the clock drifts are maximal.

? ?) sends its second Cold Start Frame ? ? ×??? ? ? ? after the collision, node (? ? ?) sends its ×??? second Cold Start Frame ? ? ???? ? ? ? ? ? after the
????
Node (?

collision. So the sum of the clock drifts of both nodes must ×??? be lower than the duration of the minimum slot, ? ? , minus a constant ? , which accounts for the propagation delay of the communication medium.

×??? ?? ??

(6) Algebraic transformations lead to an upper bound for the duration of one TDMA round. Given the maximum clock ×??? drift rate ? ? , the minimum slot duration ? ? , and the maximum propagation delay of the communication medium ? the maximum duration of one TDMA round ???? is given by:

? ? ? ?? ? ???? ? ? ? ?? ? ????

×??? ? ? ?? · ×??? ? ? ? ? ??

????

?·?

? ? ? ×??? ? ? ? ? ? ??? ? ? ? ??

(7)

Cold Start Timeout n2

3

0

1

0

1

Startup Timeout 2

3
drift of node 3 = ?ρmax

0

1

2

0

1

2

Startup Timeout 3 Cold Start Timeout n3

Listen Timeout n3

drift of reference clock = 0

TDMA round

0

1

2

3

t1
drift2

t2 td t3

real time

Example: In the current prototype implementation of TTP/C [7], a communication protocol for synchronous systems, the minimal frame length is ? bits. If we assume a bit rate of ? Mbit/s, the transmission phase of such a frame equals ? ×, and take a duration of ??? × for protocol execution, when no frames are sent (called inter-frame gap), ×??? the smallest slot duration ? ? is ?? ×. If we further assume typical communication medium parameters, i.e., a maximum length of the communication medium of ???? and a signal transmission speed of ? ? ?? ? , the propaga× tion delay of the communication medium ? is ? ×. As maximum drift rate ? ? we choose ??? × ×, because common quartz oscillators used in communication systems provide a drift rate in the order of ??? × ×. Using this assumptions and Equation 7, we get an upper × for one TDMA round ???? . Since, this bound of ? ? upper bound is larger than TDMA round durations for common applications, we will neglect clock drifts for the further analysis. Moreover, we can assume that clock drifts will not cause collisions. 3.2.3 Worst-Case Startup Time We de?ne the duration of the startup as the time after at least two correct nodes are powered up until at least two correct nodes operate synchronously. In the fault-free case, the worst-case duration of the startup (WCSUP) occurs if: ? only the two nodes with the longest Listen and Cold Start Timeouts are powered up ?rst and no other nodes are powered up until these two nodes operate synchronously, and ? those nodes send their Cold Start Frames at approximately the same time, so that the sent frames collide (as described in Section 3.2.1). ?? So, we have two correct nodes powered up ?×??? before ?? ×? ?? after the the collision. The collision occurs at ?? . ? ??

drift3

Figure 6. Drift of node 2 and node 3 leads to collisions We will now analyze a system of N nodes: ×??? Given a minimal slot duration of ? ? in the TDMA ???? and a maximal clock drift rate round with duration ? ? for all nodes, we are able to calculate an upper bound for ???? . The requirements for the worst case are:

? ?

Node (? ? ?) and node (? ? ?) have slots of minimal duration. The clock drift rates of the nodes must be maximal, ? ?.

collision, node (N-2) sends it second Cold Start Frame on the medium, forcing node (N-1) into Listen State. Node (N ?? 2) sends the next Cold Start Frame at ?? ?? ·? ? ? ??×? ?? . At ?? , node (N-1) is in Listen State. Consequently, the nodes (N-1) and (N-2) start operating synchronously at ?? , and the worst-case duration for startup in the fault-free case is given by:
? ?? ?

???? ?

? ×? ? ?? ×? ?? ? ?? · ? ? ? ??

(8)

If the guardian is equipped with enough intelligence, i.e., it can decide if a node is faulty after the ?rst transmission of a faulty node and block further transmissions, the WCSUP is not prolonged by a faulty node. The worst-case scenario in the case of a present faulty node is as follows: Nodes (N-1) and (N-2) are in Listen State and would change into Cold Start State at ?? . How? ever, a faulty node sends a faulty frame a suf?cient amount of time, ( is marginally larger than ? de?ned in Section 3.2.2), before ?? . Nodes (N-1) and (N-2) receive this ? frame, which prevents them from transiting into Cold Start State. Consequently, both correct nodes restart their Listen Timeouts. Since we bene?t of an intelligent guardian, the faulty node can be prevented form sending on the medium after ?? . Node (N-2) sends the next Cold Start Frame at ?? ? ? ? (which is ?×??? after ?? ), node (N-1) receives this frame ? ? and starts operating synchronously to node (N-2) at ?? . ? Thus, the WCSUP in presence of a faulty node is:
? ?? ?

can disturb the timeout pattern. In the value domain, there can be errors in the syntax or the semantic of frames. Syntactically incorrect frames are of no consequence, because they are detectable by a guarding instance that calculates the checksum, and may only lead to restart of timeouts. Semantically incorrect frames have more impact on the startup process. Furthermore, outgoing link failures of one node can prolong the startup phase. After careful consideration, we see that non-trivial failures during the startup phase can be restricted to the following scenarios. The consequences of the transmission of syntactically correct but semantically incorrect frames are considered. The problem of a deaf node, i.e., a node that wont receive any frames, is explained. Furthermore we will show how the violation of the startup timeouts will affect the startup process.

4.1 Violation of the Node’s Startup Timeout
In the presented startup algorithm, there can be at most one collision, because of the unique Startup Timeouts. The question is: what can happen, if one of these timeouts are incorrect, we say the timeouts are violated? As we see in Figure 7, node ? violates its Startup Timeout in a way that it causes additional collisions. Because the Startup Timeout of node ? is not unique anymore, the collision cannot be solved and results in another collision, which cannot be solved either. This leads to continuous collisions. To handle this type of failure, we present two possible solutions: The ?rst solution is the introduction of a counter. Each time a node is sending a Cold Start Frame it increases the counter. However, each node is only allowed to send if the counter is less than an a priori speci?ed value, which should be unique per node, and which restricts the number of retries of sending Cold Start Frames per node. This solution only prohibits continuous collisions if the node adheres to the rule of restricted number of sends. If a failure that leads to the incorrect timeout value does also effect the correct protocol execution, this solution will not be effective.

???? ??

?

? ? ?×??? ?

(9)

which is less then ? ?? ? ???? ? . This astonishing result can be explained as follows: The sending of a frame by a faulty node is a synchronization event for all correct nodes. From this time on, all nodes have synchronized their start of the timeouts and, consequently, no further collisions can occur.

4 Failure Scenarios in the Startup Phase
In this section, we focus on all failures that can occur during startup and possibly lead to more than ? collision. We will present solutions that show how those failures can be handled and that assure a startup in a bounded duration. For this analysis, we use a system of nodes without restricting the analysis. The failure scenarios in a system of N nodes can be mapped onto a system of three nodes, since there may occur only one single failure during startup, according to the fault hypothesis, and regardless how many nodes there are in startup, there can be only one Cold Start Node on which other nodes can synchronize. The number of nodes that actually synchronize is of no relevance for the analysis. We assume that the guarding instance is a passive component [11] and is not able to send itself. Furthermore we assume that a failure in one of the guarding instances can be compensated by its replicas. Possible failures during startup may occur in the value and in the time domain. In the time domain, a faulty node

Listen Timeout n1 1
n1:

Cold Start Timout n1

2

3

0

1

2

3

0

1

2

3

0

1

n2:

2

3

0

1

2

3

0

1

2

3

0

1

2

Bad Listen Timeout n2

Bad Cold Start Timeout

Real Time

t1

t2

Figure 7. Violation of the Startup Timeout of Node 2 The second solution is the introduction of a component that protects the communication medium from nodes that violate their timeouts. As we use a guardian for the communication medium during normal operation to protect the

nodes from sending outside of their slot, it is straight forward to use this instance for controlling the Startup Timeouts. In order to offer this extra service, the guardian must have knowledge of the node states and the current global time. While the nodes state can be inferred by analyzing the sent and received frames on the communication medium, the global time of the guardian has to be synchronized with the Cold Start Node at power-up time. If synchronization is not done, the ?rst collision caused by an incorrect timeout cannot be handled, but every following attempt of node ? to send a Cold Start Frame too early would be terminated by the guardian.

where a frame stems from. Since the node ID is used to determine the slot position at startup, this may lead to additional collisions that may violate the bounded duration for startup. We propose a solution to this problem at the end of the section. 4.3.1 Possible Shutdown of Mislead Node
Listen Timeout n0 0
n0:

1

2

3

0

1

2

3

0

1

2

3

n2:

2

3

0

1

2

3

0

1

2

3

0

4.2 A Deaf Node
n2 in Listen State

As the Cold Start Node is acknowleged by a node which received the sent Cold Start Frame, we describe a situation where the Cold Start Node is deaf (i.e., can send but cannot receive frames), and therefore cannot be acknowleged. A deaf Cold Start Node is not able to proceed into Active State and thus would continously send Cold Start Frames. A node that has received a Cold Start Frame from this node while it is in Listen State would expect to be synchronous with the Cold Start Node and proceed its transmission schedule. The Cold Start Node, however, will not be able to receive the acknowledgement and would again send after its Cold Start Timeout. In Figure 8, node ?, which is the Cold Start Node in this situation, is deaf. In the scenario, the deaf node will cause a collision by sending outside of its slot. A solution for this kind of failure is the use of a guardian as described in Section 4.1, i.e., a guardian that is able to analyze sent and received frames. The guardian is not deaf, and thus knows when the Cold Start Node has been acknowledged. So it is able to protect the communication medium from a faulty Cold Start Node to send other Cold Start Frames after at least one node has synchronized using a Cold Start frame.
Listen Timeout n1
1 2 3 0 1 2 3 0 1
Cold Start Timout n1

Real Time

Figure 9. Possible shutdown of mislead node As depicted in Figure 4.3.1, node ? sends a Cold Start Frame with a wrong node ID, i.e., a node ID differing from ?, say ?. We assume that node 1 is not powered up yet, but node ? is. Node ? receives the faulty Cold Start Frame and assumes that this Cold Start Frame was sent by node ? and starts to proceed the transmission schedule at the position that equals the node ID sent in the faulty Cold Start Frame. If by chance the transmission pattern is built up like in Figure 4.3.1, i.e., all slots are of equal length, node ? would send a frame, which would force node ? back into Listen State. If more than one node have synchronized on the faulty Cold Start Frame, the nodes will continue synchronous operation. The faulty node can integrate on a frame that it receives from the other nodes. If, however, no other node has synchronized on the faulty Cold Start Frame, the correct node, node ?, will not get any frame as acknowledgement for one TDMA round (but will receive another Cold Start Frame that it interprets as incorrect frame). The consequence is a shutdown and restart of the mislead node after one TDMA round. 4.3.2 Masquerading Node and Mislead Node Send Same Node ID

t 1

t 2

t 3

2

3

0

n1:

n2:
2 3 0 1 2 3 0 1 1 2 3 0 1

Listen Timeout n1 1
n1:

n2 in Listen State
2 3 0 1 2 3 0 1 2 3 0

Real Time
t1

t2

t3

n2:

2

3

0

1

2

3

0

2

3

0

1

2

Figure 8. Cold Start Node is deaf

n2 in Listen State

4.3 Masquerading of a Node
Real Time

In this section, we will study the behavior of the system, if the Cold Start Node sends a wrong node ID, i.e., a node ID differing from its assigned node ID, in its Cold Start Frame. We call a node that sends a wrong node ID a masquerading node and a node that receives a frame with a wrong node ID a mislead node. Please note that in the startup phase a node cannot judge whether a received frame contains a wrong node ID, because a node does not know

t1

t 2

Figure 10. Masquerading node and mislead node send same node ID Figure 10 depicts a scenario in which node ? is the Cold Start Node and sends a faulty Cold Start Frame with node ID ?. Node ?, which receives this frame containing an ID equal to its own ID, has to wait for one TDMA round before it is allowed to send.

If no other node was in Listen State while node ? sent the Cold Start Frame, node ? is the next node to send a frame at ?? , because node ? has to wait for its Cold Start Timeout to expire. Depending on the chosen implementation of the architecture the faulty Cold Start Node can be forced back into Listen State or it is allowed to synchronize on the sent frame of node 2. We discuss the scenario in which the Startup Timeout of the Cold Start Node equals ? in Section 4.3.3. 4.3.3 Collisions Caused by Masquerading
Listen Timeout n0 0
n0:

used for a future design and implementation of a guardian for the time-division multiple-access protocol TTP/C.

References
[1] G. Bauer, H. Kopetz, and P. Puschner. Assumption coverage under different failure modes in the Time-Triggered Architecture. 8th IEEE Int. Conf. on Emerging Technologies and Factory Automation, Antibes Juan-les-pins, France, pages 333–341, Oct. 2001. [2] T. Chandra, V. Hadzilacos, S. Toueg, and B. Charron-Bost. On the impossibility of group membership. In Proc. of the 15th Annual ACM Symp. on Principles of Distributed Computing, pages 322–330, New York, USA, May 1996. [3] B. Charron-Bost, R. Guerraoui, and A. Schiper. Synchronous system and perfect failure detector: Solvability and ef?ciency issues. In Proc. of Int. Conf. on Dependable Systems and Networks, pages 5–13. IEEE, June 2000. [4] M. Fischer, N. Lynch, and M. Paterson. Impossibility of distributed consensus with one faulty processor. Journal of the ACM, 32(2):374–382, 1985. [5] V. Hadzilacos and S. Toueg. Distributed Systems. AddisonWesley, 1993. [6] C. Jones, M.-O. Killijian, H. Kopetz, E. Marsden, N. Moffat, M. Paulitsch, D. Powell, B. Randell, A. Romanovsky, and R. Stroud. Revised version of DSoS conceptual model. Proj. Deliv. DSoS/Research Report 35/2001, Techn. Univ. Wien, Inst. f. Techn. Informatik, Vienna, Austria, 2001. [7] H. Kopetz. TTP/C Protocol – Version 0.5. TTTech Computertechnik AG, Vienna, Austria, July 1999. Available at http://www.ttpforum.org. [8] H. Kopetz and G. Bauer. Transparent redundancy in the Time-Triggered Architecture. In Proc. of Int. Conf. on Dependable Systems and Networks, pages 5–13, June 2000. [9] H. Kopetz, G. Bauer, and S. Poledna. Tolerating arbitrary node failures in the Time-Triggered Architecture. SAE 2001 World Congress, Detroit, MI, USA, Mar. 2001. [10] H. L¨ nn. Initial synchronization of TDMA communication o in distributed real-time systems. In 19th IEEE Int. Conf. on Distributed Computing Systems, pages 370–379, Gothenburg, Sweden, 1999. [11] J. Rushby. Bus architectures for safety-critical systems. In Proc. of EMSOFT, Lake Tahoe, CA, USA, Oct. 2001. Springer Verlag. LNCS Vol. 2211. [12] J. Rushby. Partitioning in the Time-Triggered Architecture. Deliv. 3 for SRI proj. 11003, CS Lab., SRI Int., Menlo Park, CA, USA, Mar. 2001. [13] W. Steiner. Startup algorithm of TTP/C: Analysis and simulation. Master’s thesis, Inst. f. Techn. Informatik, Techn. Univ. Wien, Vienna, Austria, Apr. 2001. [14] W. Steiner and M. Paulitsch. The transition from asynchronous to synchronous system operation: An approach for distributed fault-tolerant systems (including simulation). Research Report 26/2001, Techn. Univ. Wien, Inst. f. Techn. Informatik, Vienna, Austria, 2001.

1

2

3

0

1

2

3

0

1

2

3

n2:

2

3

0

1

2

3

0

2

3

0

1

n2 in Listen State

Real Time

t1

t2

Figure 11. Collisions caused by masquerading We describe a startup scenario in Figure 4.3.3 in which a faulty Cold Start Frame causes a collision. As we see node ? sends a Cold Start Frame at ?? with a faulty node ID, ?. Node ? receives this Cold Start Frame containing an ID equal to its own ID and synchronizes on it. Thus, node ? expects itself to be synchronous with the Cold Start Node but in fact it is asynchronous to it. At ?? node ? sends its next Cold Start Frame and node ? send a frame according to its transmission pattern. This leads to a collision that cannot be solved without additional mechanisms, because the two nodes are not aware of the collision and will collide after one TDMA round again. 4.3.4 Solution to Masquerading We can use the guardian to analyze the Cold Start Frames. Because the guardian sits in the star center and knows on which incoming line which node sends, the guardian can block Cold Start Frames with wrong node IDs from being sent on the communication medium to other nodes.

5 Conclusions
We have presented a fault-tolerant synchronous architecture and an algorithm that ensures the transition from asynchronous to synchronous operation within a bounded duration at startup. We have also analyzed this algorithm for failure cases that violate the transition within a bounded duration. Furthermore, we have presented possible solutions that enable the transition in a bounded duration for implementation even in case of failures. Our analysis has shown that with the help of a special device, called guardian, the synchronous system will operate synchronously within a bounded duration even in single failures cases. Furthermore, simulations validated the presented concepts and solutions [13, 14]. The results of this paper are


赞助商链接
相关文章:
更多相关标签: