当前位置:首页 >> 金融/投资 >>

Resource Management in Dataflow-Based Multithreaded Execution 1


Journal of Parallel and Distributed Computing 61, 581 608 (2001) doi:10.1006?jpdc.2001.1708, available online at http:??www.idealibrary.com on

Resource Management in Dataflow-Based Multithreaded Execution 1
Lucas Roh, Bhanu Shankar, Wim Bohm, and Walid Najjar
Department of Computer Science, Colorado State University, Fort Collins, Colorado 80523 Received March 1, 1997; revised September 6, 2000; accepted November 7, 2000

Due to the large amount of potential parallelism, resource management is a critical issue in multithreaded execution. The challenge in code generation is to control the parallelism without reducing the machine's ability to exploit it. Controlled parallelism reduces idle time, communication, and delay caused by synchronization. At the same time it increases the potential for exploitation of program data structure locality. In this paper, we evaluate the performance of methods to control program parallelism and resource usage in the context of the fine-grain dataflow execution model. The methods are in themselves not new, but their performance analysis is. The two methods to control parallelism here are slicing and chunking. We present the methods and their compilation strategy and evaluate their effectiveness in terms of run time and matching store occupancy. Communication is categorized in memory, loop, call, and expression communication. Input and output message locality is measured. Two techniques to reduce communication are introduced. Grouping allocates loop and function bodies on one processor and bundling combines messages with the same sender and receiver into one. Their effects on the total communication volume are quantified. 2001 Academic Press Key Words: multithreaded architectures; code generation; quantitative evaluation; control of parallelism.

1. INTRODUCTION Fine-grain multithreading attempts to exploit instruction-level locality implicit in the von Neumann model as well as the latency tolerance and fast synchronizations of the dataflow model. It tolerates latency by rapidly switching among a set of ready threads thus improving the processor utilization. Both interprocessor communication and remote data access latencies can be masked, and therefore multithreading is especially suitable as an execution model for massively parallel processors. Current fine-grain multithreading models lie on various points along the von Neumann dataflow design spectrum. As designs move closer to the von Neumann world, thread
1 This work is supported in part by NSF Grant MIP-9113268 and by DARPA Contract DABT63-950093. E-mail: roh,bohm?cs.colostate.edu, najjar?cs.colostate.edu.

581

0743-7315?01 35.00 Copyright 2001 by Academic Press All rights of reproduction in any form reserved.

582

ROH ET AL.

size tends to become larger and data structure locality can be better exploited. Examples of these designs include HEP [1], Tera [2], J-Machine [4], and M-Machine [5]. As designs move closer to dataflow, latencies are better tolerated and parallelism is more easily exploited. Examples are Monsoon [6], *T [7], EM-4 [8], and the EARTH project [9]. There also exist software abstractions of fine-grain multithreading as exemplified by TAM [10] that can be implemented on traditional multiprocessors such as the CM-5. In multithreading models with strong dataflow heritage for which the code generator typically produces smaller thread grains requiring frequent synchronizations coupled with explosive parallelism, at any given time there may be many threads involved in either sending or waiting to receive messages or waiting for the execution unit to become available, all of which occupy some resources. Hence, the management of machine resources such as memory and communication network becomes more important. In many cases, the management of resources takes advantage of different forms of locality. In this paper, we quantitatively evaluate a set of compiler optimization techniques, namely slicing, chunking, grouping, and bundling that attempt to address the above issues. Slicing mainly attacks the problem of controlling parallelism. Chunking mainly attempts to exploit data locality, similar to the traditional vectorization and loop unrolling methods. Simulation results are presented that compare the effectiveness of these techniques against code with unrestrained parallelism. The results indicate that the chunking method helps reduce the execution time and also shows an appreciable decrease in the utilization of the synchronization unit. The slicing method shows lower average and maximum matching store occupancies at the expense of increased execution time. By combining both techniques, it is possible to balance speedup with resource utilization. Grouping reduces the amount of internode communication by allocating function or loop bodies on one processing node at the expense of possible load imbalance. Bundling combines tokens with the same sender thread and the same receiver thread into one message. For our experiments, tokens are classified into Memory, Call, Loop, and Expression tokens. The results show that grouping does not significantly reduce parallelism nor lead to poor load balancing, but eliminates most of the Expression tokens from the network traffic. On the other hand, bundling reduces the number of Call and Loop tokens. Together, the average reduction is about 80 0. In the Monsoon compiler study we use the simpler Livermore Loops [11]. In the other experiments, we use a set of larger Sisal benchmarks with sizes ranging from 500 to 2700 lines of source code. v AMR is an unsplit integrator taken from an adaptive mesh refinement code at Lawrence Livermore National Laboratory. v BMK11A is particle transport code developed to evaluate Cray Computer systems at Los Alamos National Laboratory. v FFT is a one-dimensional Fast Fourier Transform code. v HILBERT computes the condition number for Hilbert matrix coefficients. It uses Linpack routines.

DATAFLOW-BASED MULTITHREADING

583

v PSA is a parallel scheduler code using a variation of simulated annealing to solve the problem. v SDD solves an elliptic partial differential equation using the Symmetric Domain Decomposition method. v SGA is a genetic algorithm program finding a local minima of a bowl-shaped function developed at Colorado State University. v SIMPLE is a Lagrangian 2-D hydrodynamics code that simulates the behavior of fluid in a sphere developed at Lawrence Livermore National Laboratory. v WEATHER is a one-level barotropic weather prediction code and was originally developed at the Royal Melbourne Institute of Technology. The rest of this paper is organized as follows. In Section 2 we describe the execution model, including a basic processor model. In Section 3 we briefly summarize threaded code generation. Section 4 briefly describes the compiler transformations for the control of parallelism and reports on their effects on performance. This section contains an initial study of Monsoon code generation. Section 5 discusses communication reduction techniques and their effect on performance. Related work is discussed in Section 6. Concluding remarks are given in Section 7. Earlier versions of these results have been presented in [12] and [13]. 2. EXECUTION MODEL The multithreaded execution model used in this study is based on dynamic dataflow scheduling, where each actor represents a sequentially executing thread. A thread is a statically determined sequence of RISC-style instructions operating on registers. Threads are dynamically scheduled to execute, based upon the availability of data. Once a thread starts executing, it runs to completion without blocking and with a bounded execution time. The bounded execution time implies that each instruction must have a fixed execution time and cannot incur latency inside a thread. Therefore, latency incurring instructions have their consumers in other threads. Register values do not live across threads. Inputs to a thread comprise all the data values required to execute the thread to its completion. A thread is enabled to execute only when all the inputs to the thread are available. Multiple instances of a thread can be enabled at the same time and are distinguished from each other by a unique ``color.'' The thread enabling condition is detected by the matching?synchronization mechanism which matches inputs to a particular instance of a thread. Data values are carried by tokens. Each token consists of a continuation, an input port number to the thread, and one or more data values. A continuation uniquely identifies an activation of a single thread and consists of a color and a pointer to the start of thread. A unique color is generated for each activation of a code block such as a function or a loop. Data structures, such as arrays and records, are stored in a logically shared structure store. Results of thread execution are either written to the structure store or directly sent to their destination thread(s). A given thread activation can be executed on any processor. Since each thread is relatively small (10 to 30 instructions), global (dynamic)

584

ROH ET AL.

FIG. 1.

Abstract model of a processing node.

scheduling and near perfect load balancing is achieved by a simple hashing of the continuation. The logical structure of the processor model is presented in Fig. 1. The local memory of each node consists of an Instruction Memory which is read by the Execution Unit and a Data Memory which is accessed by the Synchronization Unit and the Execution Unit. Inputs to a thread are stored in the Matching Store; when all inputs have arrived, the corresponding thread is enabled. The Ready Queue contains the continuations representing enabled threads. There may be different contexts of the same thread that may be enabled at any given time either on the same node or on different nodes. The Structure Memory may be either distributed among the nodes, or among dedicated memory modules arranged in a dancehall configuration. The MemUnit handles the structure memory requests. The following machine configuration is simulated using a cycle-level, discrete event machine. The machine has 10 processing nodes, each with a 4-way issue super-scalar CPU with the instruction latencies of the Motorola 88110 and synchronization latencies of the EM-4: a pipelined synchronization unit with a throughput of one synchronization per cycle and a latency of three cycles on the first input. Problem sizes in our benchmarks are chosen to give reasonable simulation times and realistic processor utilization. All internode communications take 50 CPU cycles in network transit time. Every structure memory read takes the minimum of two network transits (one to send the request and another to send the reply). We assume that

DATAFLOW-BASED MULTITHREADING

585

all structure store reads and writes go through the interconnection network. Obviously, some of these messages can be made local by a judicious allocation of data structures. This, however, requires extensive static analysis in the compiler, which is beyond the scope of this paper. The size of matching store is unlimited, and therefore can handle any amount of parallelism.

3. CODE GENERATION Programs are represented in a dataflow graph form called MIDC [14]. Each node of the graph represents a thread of straight line von Neumann type instructions. Edges represent data paths along which tokens travel. In addition to the nodes and edges, there are pragmas and other specifiers to encode information (e.g., program-level constructs) that may be helpful to postprocessors and program loaders. Code generation is guided by the following objectives: minimize synchronization overhead, maximize intrathread locality, assure nonblocking (and deadlock-free) threads, and preserve functional and loop parallelism in programs. The first two objectives call for very large threads that maximize the locality within a thread and decrease the synchronization overhead. The thread size, however, is limited by the last two objectives. In fact, it was reported in [15] that blind efforts to increase the thread size, even when they satisfy the nonblocking and parallelism objectives, can result in a decrease in overall performance. Larger threads tend to have larger numbers of inputs and can result in a larger input latency, defined as the time delay between the arrival of the first token to a thread instance and that of the last token, at which time the thread can start executing [16]. Our nonblocking threads are generated from Sisal programs. Sisal [17] is a pure, first-order, functional programming language with loops and arrays. Sisal programs are initially compiled into a functional, block-structured, acyclic, data dependence graph form IF1 [18]. The functional semantics of IF1 prohibits the expression of copy-avoiding optimizations. This causes new data structures to be defined and the elements copied even when a single data element is modified, and this leads to a large amount of code just to copy data elements from one physical location to another even when it is unnecessary to do so. An extension of IF1, called IF2 [19], allows operations that explicitly allocate and manipulate memory in a machine-independent manner through the use of buffers. A buffer is comprised of a buffer pointer into a contiguous block of memory and an element descriptor that defines the constituent type. All scalar values are operated by value and therefore copied to wherever they are needed. On the other hand, all of the fanout edges of a structured type are assumed to reference the same buffer; that is, each edge is not assumed to represent a distinct copy of the data. IF2 edges are decorated with pragmas to indicate when an operation such as ``update-inplace'' can be done safely, which dramatically improves the run-time performance of the system. A top-down cluster generation process transforms IF2 into MIDC [20]. This phase breaks up the complex block-structured IF2 graphs so that threads can be

586

ROH ET AL.

generated and wires the threads together. Initial reduction values are generated in the appropriate threads. Threads terminate at control graph interfaces for loops and conditionals, and at nodes, such as memory accesses, for which the execution time is not statically determinable, in order to satisfy the deterministic execution time objective. Threads do not cross function or loop boundaries and therefore useful forms of parallelism are preserved. Although it is not strictly necessary to have threads bounded by branches at this stage, doing so provides more flexibility in later stages. The generated MIDC code is further optimized via a bottom-up stage [21] at both the intrathread and interthread levels. Intrathread optimizations consist of traditional optimizations including dead code elimination, constant folding?copy propagation, redundant instruction eliminations, and instruction scheduling to exploit the instruction level parallelism. Global optimizations include global versions of the above optimizations as well as redundant edge eliminations and merge operations that attempt to create larger threads by combining neighboring threads. The merging of threads also takes place across the branch instructions. The benchmark codes used in our experiments have thread sizes ranging from 10 to 30 MIDC instructions. There are two types of loops in SISAL. Iterative loops have loop-carried dependencies and termination tests. Parallel loops have data independent loop bodies and known loop counts. Only the parallel loops are considered for parallelization and vectorization. Since most parallel loops deal with arrays, it is instructive to know the layout of these data structures in memory. Figure 2 shows the layout of an array data structure containing an array descriptor and the data elements. SISAL arrays can start at any lower bound and can be of a variable size, and this information is encoded in the array descriptor. In order to reduce copy operations (e.g., when concatenating arrays), additional memory may be allocated on either side of the array data elements and several arrays can therefore be ``built-in-place.'' This requires an ``offset'' value to specify where the logical array starts. Thus, the start of the array is given by adding the values of the data pointer to the offset value. All elements of that array are indexed off this resultant start address. With this layout, two memory latencies are required in order to fetch a single array element.

FIG. 2.

Layout of arrays in MIDC.

DATAFLOW-BASED MULTITHREADING

587

4. CONTROL OF PARALLELISM 4.1. Loop Chunking Loop Chunking stripmines a loop into a doubly nested loop, where the inner loop consists of fixed sized, consecutive chunks of loop bodies. This optimization is much like vectorization or loop unrolling (see Fig. 3). For the loop to be chunkable, the loop bodies must access consecutive array elements. For a loop of iteration space n and a machine chunk size of c, the number of workers is w n x+1 if n mod c{0 c or n otherwise. The code dealing with the irregularly sized chunk is executed only c if such a chunk exists. A split phase FetchChunk operator is used to fetch a chunk of data from structure memory. The semantics of this operation is defined as follows: memory is reserved in the target processor's data memory to hold the chunk. SISAL is a strict language, hence, all the data elements of an array will be available when the fetch occurs. 4.2. Loop Slicing Loop slicing distributes a parallel loop over a fixed number of worker processors (see Fig. 4). Slicing reduces the resource load on the system in terms of the number of colors or activations required, with each slice taking one color rather than with each iteration. All workers perform at least w n x work (n is the size of the iteration k space, k is the number of workers), and n mod k workers perform an additional iteration of work. It should be noted that any parallel loop can be sliced. Due to

FIG. 3.

Chunk control in loops.

588

ROH ET AL.

FIG. 4.

Slice control in loops.

the semantics of SISAL there is no potential deadlock in the sliced code. This loop execution scheme is similar to K-bounding in Id [22]. The difference is that in slicing, the loops are block distributed, whereas in K-bounding the loops are cyclically distributed. The cyclic distribution is necessary for Id, because the loops can have loop-carried dependencies via nextified variables. Each portion of the iteration space assigned to workers is executed in a sequential fashion. Inputs to all iterations of the iteration space are equal except for the index value. The index value port is identified and is updated at each execution of the loop body. The reduction required in the parent loop is also divided over the iteration spaces, reducing the amount of serial reduction. Reduction operators in SISAL (sum, product, max, and min) are commutative and associative. Thus, they can be reduced in any order. 4.3. Performance Evaluation of Slicing and Chunking We evaluate the dynamic properties of our code before and after applying various combinations of slicing and chunking. The characteristics of the benchmark programs we use in this section, in terms of the number of parallel loops and chunkable loops are given, in Table 1. We note that SDD has the lowest percentage of chunkable parallel loops of around 290 and FFT has the highest percentage with 79 0. We compare four combinations of slicing and chunking: R, unconstrained parallel threaded code, used as the base case for all comparisons; C, loops chunked; S, loops sliced; and CS, loops chunked and other loops sliced. The following measures are used to evaluate performance:

DATAFLOW-BASED MULTITHREADING

589

TABLE 1 Program Characteristics of the Benchmarks
Program AMR FFT HILBERT SDD SIMPLE WEATHER Problem size [Time steps] 4_80_40 [4] 2 15 100_100 26 100_100 [5] 840 km [5] Parallel loops 87 13 65 80 78 81 Chunkable loops 34 11 30 23 42 29

v Time: the number of cycles taken by the program to execute. v 0 avg : the average occupancy of the matching store in terms of the number of threads that are waiting for inputs. v 0 max : the maximum occupancy of the matching store in terms of the number of threads that are waiting for inputs at any given time. v U P 0: the processor utilization, i.e., the percentage of time the processors are busy. v U S 0: the utilization of the synchronization unit, i.e., the percentage of time the synchronization unit is busy. In evaluating the comparative performance between the different sets of parallelism control methods, the following measures will be used. v Imp0: the percentage improvement of execution time (e) over the execution time of the unconstrained case (u), i.e., ((u?e)&1) V 100. v R(0 avg ) 0 and R(0 max ) 0: the ratios of space utilization over the space utilization of the unconstrained case. The baseline performance of the unconstrained model is given in Table 2. The table shows that the processor utilizations range from a low of 15.30 for HILBERT up to 93.5 0 for AMR. The low utilization for HILBERT is due to the fact that a typical parallel loop body is relatively small with only one or two threads, and a
TABLE 2 Performance with Unconstrained Parallelism (R )
Bench AMR FFT HILBERT SDD SIMPLE WEATHER Time 2294940 1154821 2474720 2907122 7040310 1427385 UP 0 93.5 70.7 15.3 57.2 54.5 67.8 US 0 45.7 36.1 13.1 36.2 38.1 54.1 0 avg 3774 778 259 1265 28485 4102 0 max 22148 7199 3730 11716 101758 20775

590

ROH ET AL.

significant fraction of the time is spent in the serial reductions of those parallel loops. Also, in general, the synchronization unit utilization is in the same order, albeit smaller, as the processor utilization. 4.3.1. Performance of chunking. Chunking exploits data locality and hence should decrease the execution time in addition to restraining parallelism. The improvement of the various benchmarks is given in the Table 3. The best performing chunk size for each benchmark is also presented. The experiment was conducted with chunk sizes of 8, 16, and 32. HILBERT showed the lowest improvement of 0.30. FFT showed the best improvement of 34.70. The chunkable loops in HILBERT have much smaller loop bodies than the nonchunkable loops and therefore the impact of chunking is minimal. The data access pattern of FFT is very regular and hence highly chunkable. Since parallelism is controlled only in those loops that can be vectorized?chunked, the occupancy of the matching store memory does not show any significant decrease. However, the table shows a noticeable reduction in the synchronization unit utilization. Since the matching is done a chunk at a time rather than a single value at a time, the utilization of the synchronization unit utilization should be reduced. 4.3.2. Performance of slicing. Slicing is used to control matching store memory occupancy. All parallel loops, including chunkable loops, in the benchmarks have been sliced in this experiment. The experiment has been conducted with the loops sliced with 10 or 20 workers each. In this experiment, space refers to snatching store memory. In most of these experiments, we are trading time for space. The more the parallelism is throttled, the less space it uses and the more time it takes to complete the execution in general. This is evidenced in Table 4 where the slice sizes that favor execution time over the space are chosen, and in Table 5 where the space is favored over the time. In the tables, slice sizes are specified in terms of the number of worker processes that are used to execute the parallel loops. For instance, slice size 10,20 indicates that the innermost parallel loop would be split up between 10 worker processes and the outer, second level parallel loop is split up between 20 worker processes. Table 4 shows that the processor utilizations are approximately equal to that of the unconstrained case. Good processor utilization implies that the parallelism has
TABLE 3 Performance with Inner Loops Chunked (C )
Bench AMR FFT HILBERT SDD SIMPLE WEATHER Chunk size 32 32 8 32 16 16 UP 0 95.2 76.8 15.0 55.1 51.9 63.8 US 0 29.4 15.8 12.8 31.9 32.1 46.3 Imp0 17.5 34.7 0.3 3.3 7.3 3.5

DATAFLOW-BASED MULTITHREADING

591

TABLE 4 Performance with Loops Sliced, with Best Execution Times
Bench AMR FFT HILBERT SDD SIMPLE WEATHER Slice size 10, 20 20, 20 20, 20 20, 20 20, 20 20, 20 UP 0 93.8 70.4 16.8 64.2 51.5 74.9 US 0 58.5 52.0 18.2 49.2 41.1 64.4 Imp 0 &8.2 &12.5 &18.1 &5.4 &12.5 &8.8 R(0 avg ) 0 28.8 21.5 60.2 73.2 5.8 78.0 R(0 max ) 0 15.1 10.9 77.4 33.5 8.7 86.7

not been throttled to the extent where the processor is sitting idle when it should not. The improvement ranges from &5.40 in SDD to &18.1 0 in HILBERT. The average space used drops dramatically. The best saving is shown by SIMPLE, which utilizes on average 5.80 of what is required by the unconstrained (R) case. The worst is WEATHER, which requires 78.0 0 of the space occupied in the unconstrained case. When maximal occupancy is considered, the best saving is still shown by SIMPLE, utilizing just 8.70 of the maximum occupied in the unconstrained case. The worst saving is shown by WEATHER, saving just 13.3 0 of the space. Table 5 shows the results for those cases where processor utilization remains fairly high but the space utilized is the lowest. In this case, the improvements drop even more with the range of &9.50 in AMR, to &24.9 0 in HILBERT. In the average space case, the best saving is shown by SIMPLE, requiring just 2.9 0 of the space. The worst is shown by SDD, requiring 51.6 0 of the regular space. In the maximum space case scenario, the best savings is again shown by SIMPLE with 4.20 ratio and the worst savings is shown by WEATHER, with 44.6 0 ratio. 4.3.3. Performance of combined chunking and slicing. The next logical step is to combine chunking and slicing and try for good execution times with low space utilization. Table 6 shows the effect of combining chunking, with sizes as in Table 3, and the slicing with parameters as in Table 4. These settings favor better execution time over space saving and provide a good trade-off. In the case of WEATHER, the
TABLE 5 Performance with Loops Sliced, with Best Occupancy
Bench AMR FFT HILBERT SDD SIMPLE WEATHER Slice size 10, 10 20, 10 20, 10 20, 10 20, 10 10, 10 UP 0 92.3 69.5 15.4 60.0 48.8 60.4 US 0 57.6 51.3 16.6 46.0 39.0 51.7 Imp 0 &9.5 &12.5 &24.9 &11.6 &16.9 &19.2 R(0 avg ) 0 14.9 15.5 48.2 51.6 2.9 29.3 R(0 max ) 0 8.6 6.4 39.9 26.2 4.2 44.6

592

ROH ET AL.

TABLE 6 Performance with Inner Loops Chunked and Loops Sliced, with Best Processor Utilization
Bench AMR FFT HILBERT SDD SIMPLE WEATHER Chunk size 32 32 16 32 16 16 Slice size 10, 20 20, 20 20, 20 20, 20 20, 20 20, 20 Imp0 +8.9 +30.5 &16.8 &0.6 &1.6 &0.6 R(0 max ) 0 15.2 5.4 77.4 43.4 5.9 86.6

space saving is on the order of 13 0 over the unbounded case for a time slow down of 0.6 0. Table 7 shows the combination of the chunking parameters from Table 3 and the slice parameters from Table 5, that favors space saving over execution time. This table shows greater savings in the matching store space utilization. However, as expected, the time taken to solve the problem increases. One extreme case is WEATHER, where space usage is about half of that shown in Table 4, but the execution took almost 20 0 longer. In the case of FFT, the improvement is 27.8 0 while only using 5.3 0 of the space used by the unconstrained case. An important result is that the upper bound on the space usage is determined by the slicing scheme and is no longer dependent on the problem size. 4.3.4. Determining throttle values. Here we describe the method used to arrive at the best throttle parameters. For this we have chosen the benchmark FFT. Table 8 shows the results of various experiments executed for the FFT benchmark. The first experiment is for the unconstrained parallelism case. Results indicate that processor utilization is about 71 0 and the maximum occupancy is 7199 thread slots in the matching memory. The next set of experiments is run for the three different chunk sizes of 8, 16, and 32. As expected, the matching store occupancy does not change much in all the cases. In this particular case, the table indicates that all three chunk sizes yield similar performance with the chunk size 32 being the best at 34.70 improvement. Processor utilization remains fairly uniformly high in all the cases. The synchronization unit utilization drops by half.
TABLE 7 Performance with Inner Loops Chunked and Loops Sliced, with Best Occupancy
Bench AMR FFT HILBERT SDD SIMPLE WEATHER Chunk size 32 32 16 32 16 16 Slice size 10, 10 20, 10 20, 10 20, 10 20, 10 10, 10 Imp0 +7.7 +27.8 &23.8 &7.8 &6.2 &18.7 R(0 max ) 0 8.1 5.3 39.9 32.1 3.5 44.7

DATAFLOW-BASED MULTITHREADING

593

TABLE 8 FFT: Arriving at the Best Throttle Value
Exp. R C Chunk Slice Imp0 1 32.5 34.3 34.7 &43.0 &11.9 &13.7 &12.5 30.5 27.8 UP 0 70.7 78.2 77.5 76.8 45.8 70.9 69.5 70.4 77.3 75.7 US 0 36.1 18.7 16.8 15.8 33.5 52.2 51.3 52.0 23.1 22.6 0 avg 778 411 386 371 228 428 121 167 79 71 0 max 7199 7199 7203 7203 4380 5600 461 783 391 387

8 16 32 10 20 20, 10 20, 20 20, 20 20, 10

S

CS

32 32

In the third experiment, the best slicing levels are determined. The first runs sliced only the innermost parallel loops. As stated earlier, the slice size is expressed in the number of workers that are chosen for parallel loops at the nesting level specified. The sizes chosen were 10 and 20. When the number of worker processors is 10, the throttle is too strong and processor utilization drops to 45.80. The processor utilization with 20 workers remains fairly high at 70.90 and hence is better. The next step is to slice the parallel loops at the second level also, using the inner loop slice of 20. The experiments were run for 20,10 and 20,20. Processor utilization for both runs remain fairly high. Slicing with 20,20 is faster but slicing with 20,10 has a greater space saving. The last set of experiments combines the chunking and slicing techniques, which will give us an acceptable execution speed at a low resource utilization. Two experiments are run for chunk size 32 and loop slicing 20,20 and 20,10, respectively. In this case, a chunking factor of 32 and a loop slicing factor of 20,20 would probably provide the best balance between the execution time and the space utilization. 4.4. Case Study: A Monsoon Implementation of SISAL MIDC This section presents a study of threaded code generation and resource management strategies for MIDC as ported to a one-node Monsoon Machine [23]. The compiler processes the MIDC code and generates a Monsoon Assembly (.masm) code file and Id code to interface with the type system. These in turn are processed by the Monsoon Assembler and Id Compiler to generate the required Monsoon Object Code. The Id World Interface to the machine is utilized to load the programs and provide simple Input and Output. We use the simpler Livermore loops in this study [11], because the current compiler does not control the size of the generated threads, nor the number of activation frame slots in a thread. This has the effect that for large functions more frame slots are allocated than are available. There are methods to limit the number of frame slots, such as reusing the frame slots and splitting overly large threads. Optimized reuse of frame slots is implemented in the Id compiler, using linear

594

ROH ET AL.

programming techniques [22]. Splitting (and merging) threads is already done in our compiler at the higher MIDC level. It just needs to be reemployed at this lower level. Before these transformations are implemented, it is important to study the effect of the naive algorithm. The Monsoon is a distributed memory multithreaded multiprocessor architecture with a shared address space. Each processing unit is called a node. Each node has a portion of the address space of the machine under its control. A node contains a Processing Element (PE) and an I-Structure Board (IS). Within each PE there can be many threads of control, eight executing simultaneously in the pipeline and up to 32,000 awaiting execution. The state of each thread is contained in a computation descriptor, or CD, with five registers: a continuation register C, a value register V, and three temporary registers T1, T2, and T3. The continuation register defines the context in which the thread executes. It contains a pointer to instruction memory that indicates the next instruction to be executed and a frame pointer, which is used as the base address of an activation frame for a procedure invocation. Data words have three presence bits associated with them defining the state a memory location is in. They affect the behavior of instructions that read and write the word. Presence bits are used in the implementation of synchronization protocols. The Join Protocol is used to synchronize two or more threads, implementing a dataflow-matching mechanism. The Imperative Protocol is used for imperative loads and stores. Any number of stores and loads may be performed on the location in any order. The I-structure Protocol synchronizes multiple consumers and a single producer. Initially, the location is empty. An I-Store operation stores a value in the location and sets the presence bits to present. Subsequently any number of I-fetch operations may be performed on the location. When a location is empty, I-fetch operations are deferred until an I-store operation takes place, after which the deferred fetches receive the stored value. The M-structure Protocol implements a mutual exclusion protocol. Initially, a location is empty. A Put operation stores a value and sets the presence bits to present. A subsequent Take operation reads the stored value and sets the presence bits back to the empty state. Multiple Take operations on an empty location are deferred, causing the location to enter into the lock-deferred state. A Put operation will satisfy exactly one of the deferred Take operations. Multiple Put operations result in an error. Functions are implemented by code blocks. Each active code block has access to a portion of local memory called the frame. The amount of frame storage used by the code block is fixed at compile time. Frame slots are required for storing temporary values and for performing synchronizations. When calling a function, a frame is allocated, the addresses of the function and the frame are combined to produce a context C, and argments and a return address are sent to C. This Sisal calling convention is simpler than the Id version as Sisal is a strict language and does not support curried and higher order functions. In determining the multithreaded computation model for Monsoon, three options were tested. In the dataflow model threads are synchonized using the Join Protocol. It is also possible to use a barrier to test for the availability of all thread inputs, as threads are strict and thus cannot start until all of their inputs are available.

DATAFLOW-BASED MULTITHREADING

595

Instead of values being copied, as in the dataflow model, they are accessed in the activation frame where they are stored. There are two options for global memory operations. The nonstrict barrier model uses the split-phase I-structure Protocol, where the initiator and consumer of a transaction live in separate threads. The second, the strict barrier model, relies on the strict semantics of Sisal, guaranteeing that when a barrier is passed, all inputs to the threads triggered by the barrier are available; also, all array data to be read by the threads is completely defined, and thus the Imperative Protocol can be used. The three models described above are compared using Livermore loops [11]. The dataflow model executes 30 0 more instructions than the nonstrict barrier model, as too many values are copied. The nonstrict barrier model in turn executes 90 more instructions than the strict barrier model, which is selected for further optimization. General purpose MIDC threads initiating parallel loops are merged and optimized for Monsoon. Copy propogation and redundant code elimination are implemented. Contexts, and thus frames, are merged, avoiding many relatively expensive context creations. Merging contexts agressively is very effective but causes the problem that frames get bigger than the allowed maximum. This is why some of the Livermore loops are not used at this time. The problem can be solved by reusing frameslots for variables that do not live at the same time. The analysis for this is similar to register allocation and is based on graph coloring. Table 9 shows
TABLE 9 Strict Barrier Model before and after Optimization in Machine Cycles
Execution time With basic optimizations 1,779,000 1,318,000 1,109,000 1,253,000 1,468,000 2,014,000 1,638,000 369,558 290,582 487,703 1,441,000 1,686,000 1,111,000 461,974 275,619 541,945 333,682 314,201 1,046,000

Program Loop 1 Loop 2 Loop 3 Loop 4 Loop 5 Loop 6 Loop 7 Loop 8 Loop 9 Loop 10 Loop 11 Loop 12 Loop 16 Loop 17 Loop 19 Loop 20 Loop 21 Loop 22 Loop 24

Size 990 101 1001 35 1001 64 995 20 101 101 1001 1000 75 101 101 100 10 101 1001

Unoptimized 2,181,000 1,648,000 1,405,000 1,585,000 1,876,000 2,440,000 2,042,000 623,016 335,990 530,553 1,824,000 2,072,000 1,241,000 532,146 346,670 629,216 383,461 360,993 1,238,000

Imp0 18.43 20.02 21.06 20.84 21.74 17.45 19.78 40.68 13.51 8.07 20.99 18.62 10.47 13.18 20.49 13.86 12.98 7.56 15.50

596

ROH ET AL.

the significant improvements made due to these optimizations. The measurements include the cycles required to trap to the run-time system. 4.4.1. Chunking and slicing on monsoon. A total of three chunking and slicing experiments are performed. The first deals with the effect of chunking, the second with the effect of slicing, and the third with their combined effect. Chunking on Monsoon stripmines an inner loop into a nested loop with a constant sized, tight, highly optimizable, inner loop. The MIDC FetchChunk operator computes the start address of a chunk. Chunking creates fewer and bigger threads and allows loop inputs to be shared. The effect of chunking on the Livermore Loops with varying chunk sizes is given in Table 10 and is compared to the unbounded case. The results show that chunking has an enormous effect on performance. Loops 2, 4, and 5 are essentially iterative loops, but still benefit from chunking. In the cases of loop 2 and 5 this is because the initial array can be constructed in a chunkable parallel loop and there is a chunkable reduction. Loop 4 is an iterative outer loop with very small parallel inner loops giving rise to one optimizable chunk. For chunking to be most effective, the size of the chunk should be approximately one-eighth of the size of the iteration space of the innermost loop. This creates eight
TABLE 10 Effect of Chunking and Slicing on Livermore Loops
Chunking Chunk size 64 64 64 64 64 8 64 8 8 8 64 64 8 8 8 8 8 8 64 Chunk size 128 128 128 128 128 16 128 16 16 16 128 128 16 16 16 16 16 16 128 Chunk size 256 256 256 256 256 32 256 32 32 32 256 256 32 32 32 32 32 32 256 Slicing Number of slices 8 8 8 8 8 8 8 8 8 8 8 8 8 8,8 8 8 8 8 8,8 8 8

Loop Loop 1 Loop 2 Loop 3 Loop 4 Loop 5 Loop 6 Loop 7 Loop 8 Loop 9 Loop 10 Loop 11 Loop 12 Loop 16 Loop Loop Loop Loop 17 19 20 21

Imp0 687 546 1057 712 302 108 703 146 207 135 309 859 280 173 163 163 178 167 213

Imp0 727 569 1114 752 305 108 753 147 225 135 312 904 320 185 170 174 186 179 216

Imp0 595 510 922 649 294 108 556 144 227 143 300 729 290 186 170 173 17 178 201

Imp 0 472 413 596 491 258 280 464 130 252 211 263 531 289 315 172 171 177 156 157 216 199

Loop 22 Loop 24

DATAFLOW-BASED MULTITHREADING

597

large threads for a particular loop and optimally utilizes the pipeline of the machine. Slicing on Monsoon allocates a fixed number of contexts to execute a loop. Each context executes a slice of loop bodies sequentially. In establishing the effect of slicing, the lesson learned from the chunking experiment is kept in mind: the number of threads of execution should be around eight. Table 10 compares the results of slicing to the unbounded case. In two of the benchmarks, Loop 16 and Loop 21, improvements are shown when they are sliced at the second level. In 13 of the benchmarks chunking performs better than slicing. In these benchmarks, the inner parallel loops are chunkable, and control for chunking is very simple. In the loops where slicing performs better, the innermost loops are not chunkable. Slicing helps in reducing the execution time of all loops, both chunkable and nonchunkable. Chunking and slicing are not mutually exclusive and can be combined. The codes where slicing performs better than chunking are run with both slicing and chunking turned on. Table 11 shows that in four cases the combination performs better than any one of the two, and in two cases (loop 6 and 10) the overhead of chunking is larger than its benefit and pure slicing is the best approach. 4.4.2. Further optimizations. The Sisal compiler presented by no means generates the best possible code. There are short-comings that need to be addressed to generate much more efficient code. As mentioned earlier, reuse of frame slots will reduce frame sizes. As a simple example, frame slots can be reused for different threads in the same frame slot that are guaranteed not to run concurrently. The current implementation of parallel reduction uses an M-structure array to store the intermediate results that are reduced in parallel in a separate thread, which requires its own control steps. Eliminating this array and using single M-structure elements in a parallel reduction step would help reduce the load on the heap and could result in the ability to execute larger problems. This optimization would also reduce the number of instructions executed. The sequential loop schema utilized in the implementation is not very efficient as evidenced by the difference between the execution times of Livermore Loop 2. Any scheme to tighten the control of these loops will considerably speed up the code.
TABLE 11 Chunking and Slicing Combined
Program Loop 6 Loop 9 Loop 10 Loop 19 Loop 20 Loop 22 Size 64 101 101 101 100 101 Imp 0 273 283 209 180 186 228

598

ROH ET AL.

5. COMMUNICATION OVERHEAD: CHARACTERISTICS AND REDUCTION 5.1. Token Characteristics We classify tokens into two major categories: Memory tokens represent structure store reads and writes, and Regular tokens represent nonmemory related control and data tokens. The Regular tokens are broken down into three types: Loop, Call, and Expression. Loop tokens represent tokens (data and control) used in distributing parallel work and in gathering their results including simple reductions. Loop tokens could represent a significant fraction of regular tokens, especially for programs with a large number of parallel forall loops and small loop bodies with large loop counts. Call tokens represent all tokens (data and control) involved in calling a function and returning its results. Expression tokens represent all other tokens. Relevant characteristics of the benchmarks used in this section are given in Table 12. The first and second columns give the number of threads executed and tokens generated per run. The third column gives the average number of matches performed per MIDC instruction executed (MPI), which are around 0.5 matches per instruction. Table 13 shows the breakdown of tokens according to their types. Memory tokens represent a significant fraction of the total traffic, amounting to roughly a third on the average. This implies that techniques that reduce memory tokens are essential in achieving a significant reduction in the amount of network communication. Among Regular tokens it can be observed that Loop tokens comprise 30 to 97 0 of the total. For AMR, FFT, and HILBERT, the loop tokens represent the vast majority of the regular tokens implying a significant amount of traffic devoted to handling parallelism. These results are not surprising considering that threads, in our model, are constrained to be no larger than a loop body or function body and are terminated at structure store accesses. Several optimizations at the global level attempt to make threads as large as possible [21]. These three programs are highly parallel with each parallel loop containing only a few (1 3) threads.
TABLE 12 Benchmark Characteristics
Threads AMR FFT HILBERT PSA SDD SGA SIMPLE WEATHER 207,178 225,706 744,047 1,670,678 1,716,892 1,409,905 1,333,209 882,508 Tokens 3,207,898 4,148,429 3,286,375 7,098,854 10,520,838 8,713,308 9,269,106 8,045,883 MPI 0.50 0.41 0.56 0.44 0.53 0.54 0.56 0.56

DATAFLOW-BASED MULTITHREADING

599

TABLE 13 Breakdown of All Messages in Percentages
Regular Types AMR FFT HILBERT PSA SDD SGA SIMPLE WEATHER Average Loops 65.0 68.9 59.6 17.9 44.5 24.4 28.0 38.6 43.4 Calls 1.1 0.0 0.1 14.4 0.1 4.3 5.0 0.8 3.2 Expression 5.50 2.0 7.6 23.0 13.2 49.1 32.7 35.5 21.1 Total 71.6 70.9 67.3 55.3 57.8 77.8 65.7 74.9 67.7 Reads 21.7 14.9 26.3 31.8 36.1 13.2 29.0 19.8 24.1 Memory Writes 6.7 14.3 6.4 12.9 6.2 9.0 5.2 5.3 8.2 Total 28.4 29.2 32.7 44.7 42.3 21.2 34.2 25.1 32.3

Because our code generation and optimization in-lines all small functions, Call tokens represent a relatively small proportion of all tokens. Therefore, except in the case of PSA, most regular tokens are either Loop or Expression tokens. 5.2. Input Locality We define the input locality of a thread as the inverse of input latency. Input latency refers to the time delay between the arrival of the first token to a thread instance and that of the last token, at which time the thread can be enabled.

FIG. 5. Input latency in cycles (cumulative distributions). For each benchmark, the indicated line plots the percentage of threads whose waiting time is within a given time. In the left half of the figure, each horizontal tick mark represents 10 cycles, and in the right half, each tick mark represents 100 cycles.

600

ROH ET AL.

Figure 5 shows the cumulative distribution of input latencies in terms of processor cycles. The trend shows that for most programs, about 300 of threads have inputs that arrive within 10 cycles of each other, and for 50 600 of the threads inputs arrive within 100 cycles of each other. This is significant considering that each network transit takes 50 cycles. On the other hand, between 5 and 35 0 of the threads have input latencies greater than 900 cycles. These results are most useful in the design of a token storage hierarchy where short input latency threads are allocated in the cache and long input latency threads are migrated to the slower, longer term main memory. A cache architecture that exploits these properties is described and evaluated in [24]. 5.3. Output Locality Output locality quantifies the amount of locality in the destinations of the tokens generated by threads. It is defined as the ratio of the total number of tokens generated by a thread (M) and the number of distinct destination threads to which these tokens are sent (T ). The values of M, T and their ratio are shown in Table 14 for our benchmarks. The table indicates that there exist a significant amount of output locality in most programs. It shows that, on the average, about three to four tokens are sent to each target thread. Except for AMR and FFT, the average number of targets is less than two. For both AMR and FFT, each thread generates a much larger number of regular tokens sent to a larger number of threads when compared with other benchmarks. However, the ratio M?T is only slightly larger than some of the other benchmarks. AMR and FFT both have large average thread sizes and each loop body has a smaller number of threads, as revealed in the next section. The results in Table 14 indicate that this output locality can be exploited by bundling tokens destined for a same thread into one message in order to reduce the overhead in message setup and reception. We present two techniques to reduce the amount of regular token traffic: grouping and bundling. Note that we do not consider ways to reduce the remote memory accesses. For example, static data partitioning can reduce the amount of remote accesses. This requires extensive compile-time analysis
TABLE 14 Output Locality
M AMR FFT HILBERT PSA SDD SGA SIMPLE WEATHER 11.9 15.2 3.2 2.7 3.8 5.3 4.8 7.2 T 2.7 3.4 1.14 1.13 1.28 1.82 1.50 1.94 M?T 4.40 4.47 2.80 2.39 2.96 2.92 3.21 3.71

DATAFLOW-BASED MULTITHREADING

601

and goes beyond the scope of this paper. Instead, we assume that all structure store reads and writes go through the network. 5.4. Grouping As a first step toward reducing the regular token traffic, we observe that Loop and Call tokens are the primary means by which parallelism is exploited. Expression tokens represent intraloop and intrafunction thread communication that could be localized to a processor. In essence, by making sure that threads within a loop or a function body are allocated to the same processor, the communication between these threads is localized. This allocation strategy should also result in a higher locality and should better exploit the memory hierarchy. Although the grouping idea is rather obvious, our objective is to quantify its possible benefit against the cost it might entail. Particularly, there is a potential for reducing exploitable parallelism, and we need to check whether this reduction is acceptable. The allocation of a thread group, comprising either a loop or function body, to a processor is done by hashing only the color portion of a token tag, rather than the entire tag. In addition, the compiler generates code for the threads to directly write to the data memory, rather than form and send tokens. The order of execution of threads within a group is still determined by the availability of data. A policy which schedules all enabled threads from the same group together could better exploit the memory hierarchy. This needs further investigation. Also, the grouping of threads is currently performed independent of the size and internal parallelism of the function or loop body. In summary, the execution of thread groups is characterized as follows. v A thread group is a statically determined set of threads that are allocated on a single node (processor). v Different activations of the same group could be allocated on different nodes, their allocation is determined dynamically at run time (by the hashing on the color field). v The order of thread execution within a group is purely determined by the availability of data. The characteristics of the resulting thread groups are shown in Table 15. S G 1 represents the average number of threads in a group, and 6 G represents the average parallelism of a group in terms of threads. The table shows that each group consists of a relatively small number of threads. The table also shows the average number of instructions in a thread and the average parallelism within a thread, represented by S T and 6 T, respectively. FFT in particular has very large threads and large 1 internal parallelism. Note that the average instruction level parallelism within a thread, 6 T, is very close to four instructions per cycle. The second column indicates that intragroup parallelism is nearly one for all the benchmarks. In other words, even when there is an infinite number of processors that can exploit all interthread parallelism, threads within a group execute nearly sequentially. Therefore, no significant parallelism is lost by grouping threads on a single processor.

602

ROH ET AL.

TABLE 15 Thread and Group Characteristics
SG 1 AMR FFT HILBERT PSA SDD SGA SIMPLE WEATHER Average 2.47 1.65 3.91 4.72 3.77 5.43 5.05 3.65 3.83 6G 1.00 1.00 1.00 1.04 1.01 1.01 1.02 1.00 1.01 ST 1 30.79 44.35 7.95 9.65 11.61 11.45 12.42 16.35 18.1 6T 5.29 4.77 2.41 2.69 3.05 3.39 3.40 4.22 3.65

The effect of thread grouping on interprocessor communication is shown in Table 16. The first two columns show the number of the Regular tokens before and after thread grouping. The last column shows the percentage reduction of the Regular tokens. The table shows that there are significant reductions in the amount of regular tokens generated (nearly 30 0). Grouping only affects the Expression tokens, in fact eliminating most of these. A main exception arises from sequential forinit loops passing data across iterations. As expected, some benchmarks, such as AMR and FFT, do not gain much from grouping due to their small number of Expression tokens. 5.5. Bundling The strong output locality suggests that it may be profitable to combine separate tokens, destined to the same thread, into one large message. This message vectorization should reduce the number of network packets generated at the expense of a larger message size. It is worth noting that bundling does not introduce any additional latency since the target thread cannot be enabled until all the inputs arrive anyway.
TABLE 16 The Effect of Grouping on Regular Tokens
Tokens (Millions) before grouping AMR FFT HILBERT PSA SDD SGA SIMPLE WEATHER Total 2.30 2.94 2.21 3.93 6.08 6.78 6.09 6.03 36.36 Tokens (Millions) after grouping 2.12 2.86 1.96 2.43 4.67 2.71 3.54 3.37 23.66

Reduction (in 0) 7.5 2.8 11.3 38.0 23.2 60.0 41.9 44.1 35.0

DATAFLOW-BASED MULTITHREADING

603

TABLE 17 The Reduction in Tokens after Grouping and Bundling
Loop tokens (_10 6 ) Bench AMR FFT HILBERT PSA SDD SGA SIMPLE WEATHER Total Before 2.09 2.86 1.96 1.27 4.68 2.13 2.60 3.11 20.70 After 0.47 0.64 0.70 0.41 1.58 0.65 0.71 0.84 6.00 Call tokens (_10 6 ) Before 0.04 0.00 0.00 1.02 0.01 0.37 0.46 0.06 1.96 After 0.01 0.00 0.00 0.53 0.00 0.21 0.12 0.02 0.89 Expression tokens (_10 6 ) Before 0.18 0.01 0.25 1.63 1.39 4.28 3.03 2.86 13.63 After 0.00 0.00 0.00 0.07 0.00 0.07 0.22 0.05 0.41 Reduction (in 0) 79 78 68 74 74 86 82 85 78

Grouping is effective for Expression tokens and optimizes messages away for them. Therefore bundling could applied to Loop and Call tokens. Table 17 shows the result of combined bundling and grouping. It shows the number of messages generated before and after applying the techniques. In the bundling technique, we limited the maximum message size to contain no more than five data values. The reduction is most significant for Loop tokens. This is due to the fact that activating a loop iteration requires sending a lot of information to the same node. Almost all the Expression tokens are eliminated due to grouping. The percentage reduction in the number of regular messages is given in the last column of the table. The results are fairly uniform in that all programs result in significant reductions ranging from 68.2 to 86.20 with the average of nearly 80 0. These results demonstrate that bundling and grouping are complementary techniques. When one is ineffective, the other makes up for it. For example, AMR and FFT have a small number of intragroup tokens to be eliminated, but bundling reduces the number of their Loop and Call tokens by more than a factor of four, resulting in overall reductions comparable to other benchmarks.
TABLE 18 Overall Percentage Reduction of All Tokens and MPI
Reduction (in 0) AMR FFT HILBERT PSA SDD SGA SIMPLE WEATHER Average 56.6 55.4 45.9 41.0 42.7 68.1 53.9 63.6 53.4 MPI 0.19 0.14 0.29 0.23 0.29 0.14 0.24 0.18 0.21

604

ROH ET AL.

In the first column of Table 18, the percentage reduction in the number of all tokens generated is given. We see that more than half of all tokens are eliminated. The second column of the table gives the resulting MPI figures and shows a large drop in matches per instructions. This ranges from about 0.14 to 0.29 with the average of 0.21. The larger average message size should result in slightly higher overhead per message. However, the network transit time should not be significantly affected if we assume a cut-through type of routing. Message bundling should therefore be effective in reducing the overall communication overhead. 6. RELATED WORK Parallelism control was first proposed in fine-grain dataflow architectures, due to the large amount of parallelism that was being generated. This amount was even large enough to cause resource deadlocks, which may be partly the cause of the eventual abandonment of fine-grain architectures in favor of coarser grain multithreaded architectures. However, the problem of matching program and machine parallelism has resurfaced in multithreaded architectures, as problem sizes have grown larger. The two main parallelism control mechanisms proposed for fine-grain architectures are Throttling of tasks [25 27] and K-bounding [28]. In addition, Egan et al. [29] have proposed methods of slicing the iteration space. Throttling is a run-time method of controlling parallelism. New activation requests may be suspended if the run-time mechanism deems that there is too much parallelism, based on availability of resources. The suspended process is reactivated some time later. When an active process finishes, its activation name can be used by another process. When sufficient resources become available, suspended processes can be unsuspended. There is no notion of suspending an active process. Processes must be fairly large, otherwise they lead to too much throttle overhead. Processes that are too large would have large internal parallelism, which would cause resource deadlocks. The key to throttle control is to find the right balance between the exploitation of parallelism and the use of resources. K-bounding is another method proposed to control fine-grain parallelism in loops. The compiler analyzes the code and determines the maximum resource usage for a loop cycle. The run-time system decides the number of loop cycles that can be allowed to execute in parallel, based on the activity level of the machine and the static information of maximal resource usage. Machine resources are recycled and reused. The slicing proposed by Egan et al. splits the iteration space between k workers. Each of the workers executed iterations in steps of k. This involves the recycling of colors. However, some additional work must be performed to reestablish the order of the results coming from the loop body to that of the context enclosing the loop invocation. Teo and Bohm [30] have proposed a method of chunking iterative instructions by letting them generate a fixed number of tokens with incremental indices in the tag. After that, inputs to the iterative instruction are recycled, which stops them from swamping the machine with an unbounded amount of tokens and allows them to be controlled.

DATAFLOW-BASED MULTITHREADING

605

Most multithreaded models that use code-block-based frames [7, 8, 10, 31] group threads within a code-block, and bind code-blocks to processors. Although the grouping technique presented in this paper uses the same code-block-based approach, our model does not preclude other methods, such as grouping according to the amount of communication between threads. This method might be useful in the presence of a large loop or function body and might reduce potential load balancing problems. The use of quanta as a way to exploit locality at the interthread level has been first studied by Culler et al. for TAM [32, 33]. It uses a software scheduling policy that favors threads within the currently executing quantum and allows sharing of state between these threads (such as registers). Such a scheduling policy could be fruitfully applied to our model. By exploiting communication locality, the cost of communication for large-scale multiprocessors can be minimized. In [34], an analytical model is developed to study the impact of exploiting the communication locality. The potential performance gain is limited by the degree of multithreading and the network limits. With the high degree of multithreading in our model, the network represents a major limitation. Without an expensive hardware solution to increase the capacity limits, communication locality must be exploited to reduce the effective load on the network. Another approach which also exploits communication locality is to bundle several reads into one chunk read [12]. The filaments project [3], which addresses issues in fine-grain multithreading from a von Neumann point of view, has some interesting approaches to controlling parallelism. Filaments are clustered (``implicitly coarsened'') statically by inlining. Also, dynamic patterns of use of filaments are recognized and optimized. Filaments also has an adaptive run-time approach to data placement. Data placement strategies can be dynamically adapted, and used by the same filaments that are moved to the same node. 7. CONCLUSION There are valid reasons to control parallelism and to take advantage of data locality in the existing and proposed multithreaded models. In this paper, we examined chunking and slicing techniques that address the above issues. Our experimental results indicate that slicing schemes remove the upper bound of space utilization from the realm of problem size to the size of the throttle that is used. Chunking improves the execution time and reduces the load on the synchronization unit, but it does not provide any space savings. The combination of the two techniques, each having different goals, helps to balance the execution time with the resource usage. Finding the right balance is not an easy task. In adopting the right throttle, it is important to examine the processor utilization, as well as the synchronization unit utilization. We have defined and quantified the input and output message locality of our programs. The high degrees of input locality suggest that a cached implementation of the synchronization unit would be beneficial. Locality can be increased by grouping threads together on one processor (i.e., those threads belonging to the same loop or

606

ROH ET AL.

function body). The high degree of output locality can be exploited by bundling messages. The combination of the two techniques reduces the message traffic by 80 0 in our benchmarks.

REFERENCES
1. B. J. Smith, Architecture and applications of the HEP multiprocessor computer system, SPIE (Real Time Signal Process.) 298 (1981), 241 248. 2. R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Portfield, and B. Smith, The Tera computer system, in ``Int. Conf. on Supercomputing,'' pp. 1 6, Assoc. Comput. Mach., New York, 1990. 3. V. W. Freeh, D. K. Lowenthal, and G. R. Andrews, Distributed filaments: Efficient fine-grain parallelism on a cluster of workstations, in ``First Symposium on Operating Design and Implementation,'' pp. 201 212, 1994. 4. W. J. Dally, J. Fiske, J. Keen, R. Lethin, M. Noakes, P. Nuth, R. Davison, and G. Fyler, The message-driven processor: A multicomputer processing node with efficient mechanisms, IEEE Micro. 12(2) (April 1992), 23 39. 5. M. Fillo, S. W. Keckler, W. J. Dally, N. P. Carter, A. Chang, Y. Gurevich, and W. S. Lee, The m-machine multicomputer, in ``Proc. Int. Symp. on Microarchitecture,'' November 1995. 6. G. M. Papadopoulos and D. E. Culler, Monsoon: An explicit token-store architecture, in ``Proc. 17th Int. Symp. on Computer Architecture,'' pp. 82 91, June 1990. 7. R. S. Nikhil, G. M. Papadopoulos, and Arvind, *T: A multithreaded massively parallel architecture, in ``Proc. 19th Int. Symp. on Computer Architecture,'' pp. 156 167, May 1992. 8. S. Sakai, Y. Yamaguchi, K. Hiraki, Y. Kodama, and T. Yuba, An architecture of a data-flow single chip processor, in ``Proc. 16th Int. Symp. on Computer Architecture,'' pp. 46 53, May 1989. 9. H. Hum, O. Macquelin, K. Theobald, X. Tian, G. Gao, P. Cupryk, N. Elmassri, L. Hendren, A. Jimenez, S. Krishnan, A. Marquez, S. Merali, S. Nemawarkar, P. Panangaden, X. Xue, and Y. Zhu, A design study of the EARTH multiprocessor, in ``Proc. Int. Conf. on Parallel Architectures and Compilation Techniques,'' 1995. 10. D. E. Culler, A. Sah, K. E. Schauser, T. von Eicken, and J. Wawrzynek, Fine-grain parallelism with minimal hardware support: A compiler-controlled threaded abstract machine, in ``Proc. Int. Conf. on Architectural Support for Programming Languages and Operating Systems,'' pp. 164 175, 1991. 11. J. T. Feo, ``The Livermore Loops in Sisal,'' Technical Report, UCID-21159 Computing Research Group, Lawrence Livermore National Laboratory, Livermore, CA 94550, August 1987. 12. B. Shankar, L. Roh, W. Bohm, and W. A. Najjar, Control of loop parallelism in multithreaded code, in ``Proc. Int. Conf. on Parallel Architectures and Compilation Techniques,'' 1995. 13. L. Roh and W. A. Najjar, Analysis of communication and overhead reduction in multithreaded execution, in ``Proc. Int. Conf. on Parallel Architectures and Compilation Techniques,'' 1995. 14. W. Bohm, W. A. Najjar, B. Shankar, and L. Roh, An evaluation of bottom-up and top-down thread generation techniques, in ``Proc. Int. Symp. on Microarchitecture,'' 1993. 15. L. Roh, W. A. Najjar, and W. Bohm, Generation and quantitative evaluation of dataflow clusters, in ``Proc. Symposium on Functional Programming Languages and Computer Architecture,'' pp. 159 168, Copenhagen, Denmark, 1993. 16. W. A. Najjar, W. M. Miller, and W. Bohm, An analysis of loop latency in dataflow execution, in ``Proc. 19th Int. Symp. on Computer Architecture,'' pp. 352 361, Gold Coast, Australia, 1992. 17. J. McGraw, S. Skedzielewski, S. Allan, R. Oldehoeft, J. Glauert, C. Kirkham, B. Noyce, and R. Thomas, ``SISAL: Streams and Iteration in a Single Assignment Language,'' reference manual Version 1.2, Manual M-146, Rev. 1, Lawrence Livermore National Laboratory, Livermore, CA, March 1985.

DATAFLOW-BASED MULTITHREADING

607

18. S. K. Skedzielewski and J. Glauert, ``IF1: An Intermediate Form for Applicative Languages,'' reference manual, Version 1.0, Technical Report TR M-170, Lawrence Livermore National Laboratory, July 1985. 19. M. Welcome, S. Skedzielewski, R. K. Yates, and J. Ranelleti, ``IF2: An Applicative Language Intermediate Form with Explicit Memory Management,'' Technical Report TR M-195, Lawrence Livermore Laboratory, University of California, December 1986. 20. B. Shankar, W. Bohm, and W. A. Najjar, Top-down thread generation for SISAL, in ``Sisal '93,'' 1993. 21. L. Roh, W. A. Najjar, B. Shankar, and A. P. W. Bohm, An evaluation of optimized threaded code generation, in ``Proc. Int. Conf. on Parallel Architectures and Compilation Techniques,'' Montreal, Canada, 1994. 22. D. E. Culler, ``Managing Parallelism and Resources in Scientific Dataflow Program,'' Ph.D. thesis, MIT, June 1989. 23. J. Hicks, D. Chiou, B. Seong Ang, and Arvind, Performance studies of Id on the Monsoon dataflow system, 18 (1993), 273 300. 24. L. Roh and W. Najjar, Design of storage hierarchy in multithreaded architectures, in ``Proc. Int. Symp. on Microarchitecture,'' pp. 271 278, November 1995. 25. C. A. Ruggiero and J. Sargeant, Control of parallelism in the Manchester dataflow computer, in ``Lecture Notes in Computer Science,'' No. 274, pp. 1 15, 1987. 26. D. F. Snelling, ``The Design and Analysis of a Stateless Data-Flow Architecture,'' Ph.D. thesis, Computer Science Department, University of Manchester, Manchester, UK, 1993. 27. J. Gurd and D. Snelling, Self-regulating workload in the manchester data-flow computer, in ``Proc. Int. Symp. on Microarchitecture,'' November 1995. 28. D. E. Culler, ``Resource Management for the Tagged Token Data Flow Architecture,'' Technical Report, TR-332 Laboratory for Computer Science, MIT, January 1985. 29. G. K. Egan, N. J. Webb, and A. P. W. Bohm, Some architectural features of the CSIRAC II data-flow computer, in ``Advanced Topics in Data-Flow Computing'' (J.-L. Gaudiot and L. Bic, Eds.), pp. 143 174, Prentice Hall, New York, 1991. 30. Y. Teo and W. Bohm, Resource management and iterative instructions, in ``Advanced Topics in Data-Flow Computing'' (J.-L. Gaudiot and L. Bic, Eds.), pp. 481 500, Prentice Hall, New York, 1991. 31. R. A. Iannucci, Toward a dataflow?Von Neumann hybrid architecture, in ``Proc. 15th Int. Symp. on Computer Architecture,'' pp. 131 140, 1988. 32. K. E. Schauser, D. E. Culler, and T. von Eicken, Compiler-controlled multithreading for lenient parallel languages, in ``Proc. Symposium on Functional Programming Languages and Computer Architecture'' (J. Hughes, Ed.), 1991. 33. D. E. Culler, K. E. Schauser, and T. von Eicken, Two fundamental limits on dataflow multiprocessing, in ``Proc. IFIP WG 10.3 Conf. on Architecture and Compilation Techniques for Medium and Fine Grain Parallelism, Orlando, FL, 1993'' (Cosnard, Ebcioglu, and Gaudiot, Eds.), North-Holland, Amsterdam, 1993. 34. K. Johnson and A. Agarwal, ``The Impact of Communication Locality on Large-Scale Multiprocessor Performance,'' Technical Report LCS?TM-463, MIT, 1992.

LUCAS ROH is currently the President and CEO of Hostway Corporation. Before he co-founded Hostway, he was a staff scientist at the Mathematics and Computer Science Division of Argonne National Laboratory conducting research in automatic differentiation and compilers. He received his B.A. in physics from the University of Chicago and Ph.D. in computer science from Colorado State University. BHANU SHANKAR received his Bachelor of Computer Science and Engineering in 1989 from the R.V. College of Engineering, Bangalore, India, and his Ph.D. from Colorado State University in 1995.

608

ROH ET AL.

He worked at Microtec Research Incorporated and is now with Intel Corporation. His interests are in static binary translation, code generation, and optimization. He is currently the technical lead for developing and maintaining a high-performance Fortran 95 front end. A. P. WILLEM BOHM is a professor at Colorado State University. He received his Ph.D. at the University of Utrecht. He worked at the CWI Amsterdam on Algol 68 and at Manchester University on Sisal and Dataflow. His research interests are algorithm design, programming languages, and compilation for parallel machines. He is currently involved in the design and implementation of a highlevel, algorithmic language targetting Reconfigurable Systems based on Field Programmable Gate Arrays. WALID A. NAJJAR is an associate professor in the Department of Computer science and Engineering at the University of California Riverside. He received his M.S. and Ph.D. in Computer Engineering from the University of Southern California in 1985 and 1988, respectively, and a B.E. in Electrical Engineering from the American University of Beirut in 1979. His research interests are reconfigurable computing and multithreaded and parallel computer architecture.


相关文章:
更多相关标签: