03964.com

文档资料库 文档搜索专家

文档资料库 文档搜索专家

Author's personal copy

Available online at www.sciencedirect.com

Parallel Computing 33 (2007) 720–740 www.elsevier.com/locate/parco

High performance combinatorial algorithm design on the Cell Broadband Engine processor

David A. Bader *, Virat Agarwal, Kamesh Madduri, Seunghwa Kang

Georgia Institute of Technology, Atlanta, GA 30332, United States Received 25 April 2007; received in revised form 9 September 2007; accepted 27 September 2007 Available online 5 October 2007

Abstract The Sony–Toshiba–IBM Cell Broadband Engine (Cell/B.E.) is a heterogeneous multicore architecture that consists of a traditional microprocessor (PPE) with eight SIMD co-processing units (SPEs) integrated on-chip. While the Cell/B.E. processor is architected for multimedia applications with regular processing requirements, we are interested in its performance on problems with non-uniform memory access patterns. In this article, we present two case studies to illustrate the design and implementation of parallel combinatorial algorithms on Cell/B.E.: we discuss list ranking, a fundamental kernel for graph problems, and zlib, a data compression and decompression library. List ranking is a particularly challenging problem to parallelize on current cache-based and distributed memory architectures due to its low computational intensity and irregular memory access patterns. To tolerate memory latency on the Cell/B.E. processor, we decompose work into several independent tasks and coordinate computation using the novel idea of Software-Managed threads (SM-Threads). We apply this generic SPE work-partitioning technique to e?ciently implement list ranking, and demonstrate substantial speedup in comparison to traditional cache-based microprocessors. For instance, on a 3.2 GHz IBM QS20 Cell/B.E. blade, for a random linked list of 1 million nodes, we achieve an overall speedup of 8.34 over a PPE-only implementation. Our second case study, zlib, is a data compression/decompression library that is extensively used in both scienti?c as well as general purpose computing. The core kernels in the zlib library are the LZ77 longest subsequence matching algorithm and Hu?man data encoding. We design e?cient parallel algorithms for these combinatorial kernels, and exploit concurrency at multiple levels on the Cell/B.E. processor. We also present a Cell/B.E. optimized implementation of gzip, a popular ?le-compression application based on the zlib library. For our Cell/B.E. implementation of gzip, we achieve an average speedup of 2.9 in compression over current workstations. ? 2007 Elsevier B.V. All rights reserved.

Keywords: Combinatorial algorithms; Cell Broadband Engine processor; Multicore; List ranking; zlib; Parallel algorithms; Graph algorithms; Performance; Novel architectures

Corresponding author. E-mail addresses: bader@cc.gatech.edu (D.A. Bader), virat@cc.gatech.edu (V. Agarwal), kamesh@cc.gatech.edu (K. Madduri), s.kang@gatech.edu (S. Kang). 0167-8191/$ - see front matter ? 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.parco.2007.09.005

*

Author's personal copy

D.A. Bader et al. / Parallel Computing 33 (2007) 720–740

721

1. Introduction The Cell Broadband Engine (or the Cell/B.E.) [17] is a novel architectural design by Sony, Toshiba, and IBM (STI), primarily targeting high performance multimedia and gaming applications. It is a heterogeneous multicore chip that is signi?cantly di?erent from conventional multiprocessor or multicore architectures. It consists of a traditional microprocessor (called the PPE) that controls eight SIMD co-processing units called synergistic processor elements (SPEs), a high speed memory controller, and a high bandwidth bus interface (termed the element interconnect bus, or EIB), all integrated on a single chip. There are several unique architectural features in Cell/B.E. that clearly distinguish it from current microprocessors: the Cell/B.E. chip is a computational workhorse, and it o?ers a theoretical peak single-precision ?oating-point performance of 204.8 GFlop/s. We can exploit parallelism at multiple levels on the Cell/B.E.: each chip has eight SPEs with two-way instruction-level parallelism on each SPE. Further, the SPE supports both scalar as well as singleinstruction, multiple data (SIMD) computations [14]. The on-chip interconnection network elements have been specially designed to cater for high performance on bandwidth-intensive applications (such as those in gaming and multimedia). All these features make the Cell Broadband Engine attractive for scienti?c computing, as well as an alternative architecture for general purpose computing. Williams et al. [25] analyzed the performance of Cell/B.E. for key scienti?c kernels such as dense matrix multiply, sparse matrix vector multiply and 1D and 2D fast Fourier transforms. They demonstrate that the Cell/B.E. performs impressively for applications with predictable memory access patterns, and that communication and computation can be overlapped more e?ectively on the Cell/ B.E. than on conventional cache-based approaches. Noting that while Cell/B.E. is architected for multimedia applications with regular processing requirements, we are interested in its performance on problems with predominantly integer-based computations and non-uniform memory access patterns. We challenge the general perception that the Cell/B.E. architecture is not suited for problems that involve ?ne-grained memory accesses, and where there is insu?cient computation to hide memory latency. In this article, we present two representative case studies from the class of combinatorial computing problems, the list ranking kernel and zlib data compression and decompression library, for which we achieve impressive performance on the Cell/B.E. processor. List ranking [7,23,15] is a fundamental paradigm for the design of many parallel combinatorial and graphtheoretic applications. Given an arbitrary linked list that is stored in a contiguous area of memory, the list ranking problem determines the distance from each node to the head of the list. In prior work, using list ranking as a sub-routine, we designed fast parallel algorithms for shared memory computers, and demonstrated speedups compared with the best sequential implementation for graph-theoretic problems such as ear decomposition, tree contraction and expression evaluation [4], spanning tree [1] and minimum spanning forest [2]. Due to its low computational intensity and lack of locality, it is di?cult to attain parallel speedup for this problem. By applying our new techniques for latency tolerance and load balancing with Software-Managed threads (SM-Threads), our list ranking implementation on the Cell/B.E. processor achieves signi?cant speedup. We conduct an extensive experimental study comparing our Cell/B.E. code with implementations on current microprocessors and high-end shared memory and multithreaded architectures. Our main results are summarized here: ? Our latency-hiding technique boosts Cell Broadband Engine performance by a factor of about 4.1 for both random and ordered lists. ? By tuning just one algorithm parameter, our list ranking implementation is load-balanced across the SPEs with high probability, even for random lists. ? The Cell/B.E. achieves an average speedup of eight over the performance on current cache-based microprocessors (for input instances that do not ?t into the L2 cache). ? On a random list of 1 million nodes, we obtain a speedup of 8.34 compared to a single-threaded PPE-only implementation. For an ordered list (with stride-1 accesses only), the speedup over a PPE-only implementation is 1.56. We discuss the list ranking case study in Section 4, and introduce SM-threads and our latency-hiding technique in Section 4.3.

Author's personal copy

722

D.A. Bader et al. / Parallel Computing 33 (2007) 720–740

Data compression and decompression applications are extensively used in scienti?c and general purpose computing, and the open-source zlib package [9] is a popular library for compression/decompression. zlib is based on the LZ77 subsequence matching algorithm: data is compressed by ?rst identifying repeating subsequences of characters, and then replacing duplicate strings with reference to their previous occurrences. The strings and references are further Hu?man-coded to improve compression. The performance of the compression algorithm is data-dependent, while the decompression algorithm involves a high percentage of table lookups for ?nding matches. We exploit concurrency at multiple levels for both compression and decompression. Since the SPEs support SIMD instructions, we identify compute-intensive routines and vectorize them. At a higher level, we also partition the computation among various SPEs. For performance evaluation, we design a Cell/B.E. optimized implementation of gzip, a popular ?le-compression application based on the zlib library. We use the full ?ushing technique supported by gzip to partition a data ?le across multiple SPEs, and achieve an average speedup of ?ve over current cache-based microprocessors. We present a detailed discussion of our design and implementation of the zlib library in Section 5. The heterogeneous processors, limited on-chip memory and multiple avenues for parallelism on the Cell/ B.E. processor make algorithm design and implementation a new challenge, with potentially high payo?s in terms of performance. Analyzing algorithms using traditional sequential complexity models like the RAM model fail to account for several Cell/B.E. architectural intricacies. There is currently no simple and accepted model of computation for the Cell/B.E., and in general, for multicore architectures. We use a simple complexity model for the design and analysis of parallel algorithms on the Cell Broadband Engine architecture. We express the algorithm complexity on the Cell/B.E. processor using the triplet hTC, TD, TBi, where TC denotes the computational complexity, TD the number of DMA requests, and TB the number of branching instructions, all expressed in terms of the problem size. We explain the rationale behind the choice of these three parameters in Section 3. We then present a systematic methodology for analyzing algorithms using this complexity model, and illustrate this with an example of matrix multiplication. 2. Cell Broadband Engine architecture The Cell Broadband Engine (Cell/B.E.) is a heterogeneous multicore chip that is signi?cantly di?erent from conventional multiprocessor or multicore architectures. It consists of a traditional microprocessor (called the PPE) that controls eight SIMD co-processing units called synergistic processor elements (SPEs), a high speed memory controller, and a high bandwidth bus interface (termed the element interconnect bus, or EIB), all integrated on a single chip. Fig. 1 gives an architectural overview of the Cell/B.E. We refer the reader to [21,10,18] for additional details. The PPE is a 64-bit PowerPC core with a vector multimedia extension (VMX) unit, 32 KB L1 instruction and data caches, and a 512 KB L2 cache. It is a dual issue, in-order execution design, with two-way simultaneous multithreading. Ideally, all the computation should be partitioned among the SPEs with PPE handling the control ?ow.

PowerPC Processing Element (PPE) L1 cache

RAM Interrupt Controller SPE 0 LS SPE 1 LS SPE 2 LS SPE 3 LS Memory Controller System Memory RAM

Element Interconnect Bus (EIB) IO Device I/O Controller IO Device

SPE 4 512 KB L2 cache LS

SPE 5 LS

SPE 6 LS

SPE 7 LS

Fig. 1. Cell Broadband Engine architecture.

Author's personal copy

D.A. Bader et al. / Parallel Computing 33 (2007) 720–740

723

Each SPE consists of a synergistic processor unit (SPU) and a memory ?ow controller (MFC). The MFC includes a DMA controller, a memory management unit (MMU), a bus interface unit, and an atomic unit for synchronization with other SPUs and the PPE. The SPU is a micro-architecture designed for high performance data streaming and data intensive computation. It includes a 256 KB local store (LS) memory to hold SPU program’s instructions and data. The SPU cannot access main memory directly, but it can issue DMA commands to the MFC to bring data into the local store or write computation results back to the main memory. DMA is non-blocking so that the SPU can continue program execution while DMA transactions are performed. The SPU is an in-order dual-issue statically scheduled architecture. Two SIMD [14] instructions can be issued per cycle: one compute instruction and one memory operation. The SPU branch architecture does not include dynamic branch prediction, but instead relies on compiler-generated branch hints using prepare-to-branch instructions to redirect instruction prefetch to branch targets. Thus branches should be minimized on the SPE as much as possible. The MFC supports naturally aligned transfers of 1,2,4, or 8 bytes, or a multiple of 16 bytes to a maximum of 16 KB. DMA list commands can request a list of up to 2048 DMA transfers using a single MFC DMA command. Peak performance is achieved when both the e?ective address and the local storage address are 128 bytes aligned and the transfer is an even multiple of 128 bytes. In the Cell/B.E. processor, each SPE can have up to 16 outstanding DMAs, for a total of 128 across the chip, allowing unprecedented levels of parallelism in communication. Kistler et al. [18] analyze the communication network of the Cell/B.E. processor and state that applications that rely heavily on random scatter and or gather accesses to main memory can take advantage of the high communication bandwidth and low latency. With a clock speed of 3.2 GHz, the Cell/B.E. processor has a theoretical peak performance of 204.8 GFlop/ s (single precision). The EIB supports a peak bandwidth of 204.8 GB/s for intrachip transfers among the SPEs. The memory interface controller (MIC) provides a peak bandwidth of 25.6 GB/s to main memory. The I/O controller provides peak bandwidths of 25 GB/s inbound and 35 GB/s outbound. 3. Algorithm design and analysis 3.1. A complexity model for Cell/B.E. There are several architectural features of the Cell/B.E. processor that can be exploited for performance: ? The (SPEs) are designed as compute-intensive co-processors, while the PowerPC unit (the PPE) orchestrates the control ?ow. So it is necessary to partition the computation among the SPEs, and an e?cient SPE implementation should also exploit the SIMD instruction set. ? The SPEs operate on a limited on-chip memory (256 KB local store) that stores both instructions and data required by the program. Unlike the PPE, the SPE cannot access memory directly, but has to transfer data and instructions using asynchronous coherent DMA commands. Algorithm design must account for DMA transfers (i.e., the latency of DMA transfers, as well as their frequency), which may be a signi?cant cost. ? The SPE also di?ers from conventional microprocessors in the way branches are handled. The SPE does not support dynamic branch prediction, but instead relies on compiler-generated branch hints to improve instruction prefetching. Thus, there is a signi?cant penalty associated with branch misprediction, and branching instructions should be minimized for designing an e?cient implementation. We present a complexity model to simplify the design of parallel algorithms on the Cell/B.E.. Let n denote the problem size. We model the execution time using the triplet hTC, TD, TBi, where TC denotes the computational complexity, TD the number of DMA requests, and TB the number of branching instructions. We consider the computation on the SPEs (TC,SPE) and PPE (TC,PPE) separately, and TC denotes the sum of these terms. TC,SPE is the maximum of TC,SPE(i) for 1 6 i 6 p, where p is number of SPEs. In addition, we have TD, an upper bound on the number of DMA requests made by a single SPE. This is an important parameter, as the latency due to a large number of DMA requests might dominate over the actual computation. In cases when the complexity of TC,SPE dominates over TD, we can ignore the overhead due to DMA requests. Sim-

Author's personal copy

724

D.A. Bader et al. / Parallel Computing 33 (2007) 720–740

ilarly, branch mispredictions constitute a signi?cant overhead. Since it may be di?cult to compute the actual percentage of mispredictions, we just report the asymptotic number of branches in our algorithm. For algorithms in which the misprediction probability is low, we can ignore the e?ects of branching. ? ? Our model is similar to the Helman–JaJa model for SMPs [11] in that we try to estimate memory latency in addition to computational complexity. Also, our model is more tailored to heterogeneous multicore systems than general purpose parallel computing models such as LogP [8], BSP [24] and QSM [22]. The execution time is dominated by the SPE that does the maximum amount of work. We note that exploiting the SIMD features results in only a constant factor improvement in the performance, and does not a?ect the asymptotic analysis. This model does not take into account synchronization mechanisms such as on-chip mailboxes, and SPU operation under the isolated mode. Also, our model does not consider the e?ect of ?oating-point precision on the performance of numerical algorithms, which can be quite signi?cant [25]. 3.2. A procedure for algorithm analysis We now discuss a systematic procedure for analyzing algorithms using the above model: (1) We compute the computational complexity TC,SPE. (2) Next, we determine the complexity of DMA requests TD in terms of the input parameters: ? If the DMA request complexity is a constant, then typically computation would dominate over memory accesses and we can ignore the latency due to DMA transfers. ? Otherwise, we need to further analyze the algorithm, taking into consideration the size of DMA transfers, as well as the computational granularity. (3) It is possible to issue non-blocking DMA requests on the SPE, and so we can keep the SPE busy with computation while waiting for a DMA request to be completed. However, if there is insu?cient computation in the algorithm between a DMA request and its completion, the SPE will be idle. We analyze this e?ect by computing the computational complexity in the average case between a DMA request and its completion. If this term is a function of the input parameters, this implies that memory latency can be hidden by computation. (4) Finally, we compute the number of branching instructions TB in the algorithm. These should be minimized as much as possible in order to design an e?cient algorithm. We present a simple example to illustrate the use of our model, as well as the above algorithm analysis procedure. Pn Matrix multiplication (C = A*B, cij ? k?1 aik ? bkj , for 1 6 i, j 6 n) on the Cell/B.E. is analyzed as follows: ? We partition computation among the eight SPEs by assigning each SPE the Matrix B and n rows of Matrix A. p ? Let us assume that each SPE can obtain b rows of Aand b columns of B in a single DMA transfer. Thus, we 2 n2 would require O n2 DMA transfers, and TD is O pb2 . b Chen et al. describe their Cell/B.E. implementation of this algorithm in [6]. The algorithmic complexity is 3 n2 given by T C ? O np , T D ? O pb2 and T B ? O?n2 ?. Using the analysis procedure, we note that the DMA request complexity is not a constant. Following Step 3, we compute the average case of the computational complexity between the DMA request and its completion, assuming non-blocking DMAs and double-bu?ering. This is given by O(nb2) (as the complexity of computing b2 elements in C is O(n)). Thus, we can ignore the constant DMA latency for each transfer, and the algorithm running time is dominated by computation for su?ciently large n. However, note that we have O(n2) branches due to the absence of a branch predictor on the SPE, which might degrade performance if they result in mispredicts. Using SIMD and dual-issue features within the SPE, it is possible to achieve a peak CPI of 0.5. Chen et al. in fact obtain a CPI of 0.508 for their implementation of the above algorithm, incorporating optimizations such as SIMD, double-bu?ering and software pipelining.

Author's personal copy

D.A. Bader et al. / Parallel Computing 33 (2007) 720–740

725

Algorithm design and analysis is more complex for irregular, memory-intensive applications, and problems exhibiting poor locality. List ranking is representative of this class of problems, and is a fundamental technique for the design of several combinatorial and graph-theoretic applications on parallel processors. After a brief introduction to list ranking in the next section, we describe our design of an e?cient algorithm for the Cell/B.E. 4. List ranking using the Cell Broadband Engine Given an arbitrary linked list that is stored in a contiguous area of memory, the list ranking problem determines the distance of each node to the head of the list (see Fig. 2). For a random list, the memory access patterns are highly irregular, and this makes list ranking a challenging problem to solve e?ciently on parallel architectures. Implementations that yield parallel speedup on shared memory systems exist [11,3], yet none are known for distributed memory systems. 4.1. Parallel algorithm List ranking is an instance of the more general pre?x problem [3]. Let X be an array of n elements stored in arbitrary order. For each element i, let X(i) ? value denote its value and X(i) ? next the index of its successor. Then for any binary associative operator ?, compute X(i) ? pre?x such that X(head) ? pre?x = X(head) ? value and X(i) ? pre?x = X(i) ? value ? X(predecessor) ? pre?x, where head is the ?rst element of the list, i is not equal to head, and predecessor is the node preceding i in the list. If all values are 1 and the associative operation is addition, then pre?x reduces to list ranking. We assume that we know the location of the head h of the list, otherwise we can easily locate it. The parallel algorithm for a canonical parallel computer with p processors is as follows: (1) Partition the input list into s sublists by randomly choosing one node from each memory block of n/(s ? 1) nodes, where s is X(p log n). Create the array Sublists of size s. (2) Traverse each sublist computing the pre?x sum of each node within the sublists. Each node records its sublist index. The input value of a node in the Sublists array is the sublist pre?x sum of the last node in the previous Sublists. (3) The pre?x sums of the records in the Sublists array are then calculated. (4) Each node adds its current pre?x sum value (value of a node within a sublist) and the pre?x sum of its corresponding Sublists record to get its ?nal pre?x sums value. This pre?x sum value is the required label of the leaves. We map this to the Cell/B.E. and analyze it as follows. Assume that we start with eight sublists, one per SPE. Using DMA fetches, the SPEs continue obtaining the successor elements until they reach a sublist end, or the end of the list.

Ordered List

Ranks

1

2

3

4

5

6

7

8

9

10

11

12

Random List

Ranks

1

8

7

2

9

3

6

10

4

5

12

11

Fig. 2. List ranking for ordered (top) and random (bottom) list.

Author's personal copy

726

D.A. Bader et al. / Parallel Computing 33 (2007) 720–740

4.2. Applying the complexity model to list ranking Analyzing the complexity of this algorithm using our model, we have T C ? O n , T D ? O n and p p T B ? O?1?. From Step 2 of the procedure, since the complexity of DMA fetches is a function of n, we analyze the computational complexity in the average case between a DMA request and its completion. This is clearly O(1), since we do not perform any signi?cant computation while waiting for the DMA request to complete. This may lead to processor stalls, and since the number of DMA requests is O(n), stall cycles might dominate the optimal O(n) work required for list ranking. Our asymptotic analysis o?ers only a limited insight into the algorithm, and we have to inspect the algorithm at the instruction-level and design alternative approaches to hide DMA latency. 4.3. A novel latency-hiding technique for irregular applications Due to the limited local store (256 KB) within a SPE, memory-intensive applications that have irregular memory access patterns require frequent DMA transfers to fetch the data. The relatively high latency of a DMA transfer creates a bottleneck in achieving performance for these applications. Several combinatorial problems, such as the ones that arise in graph theory, belong to this class of problems. Formulating a general strategy that helps overcome the latency overhead will provide direction to the design and optimization of irregular applications on the Cell/B.E. Since the Cell/B.E. supports non-blocking memory transfers, memory transfer latency will not be a problem if we have su?cient computation between a request and completion. However, if we do not have enough ? ? computation in this period (for instance, the Helman–JaJa list ranking algorithm), the SPE will stall for the request to be completed. A generic solution to this problem would be to restructure the algorithm such that the SPE does useful computation until the memory request is completed. This essentially requires identi?cation of an additional level of parallelism/concurrency within each SPE. Note that if the computation can be decomposed into several independent tasks, we can overcome latency by exploiting concurrency in the problem. Our technique is analogous to the concept of tolerating latency in architectures such as the Cray MTA-2 and NVIDIA G80 that use massive multithreading to hide latency. The SPE does not have support for hardware multithreading, and so we manage the computation through Software-Managed threads. The SPE computation is distributed to a set of Software-Managed threads (SM-Threads) and at any instant, one thread per SPE is active. We keep switching software contexts so that we do computation between a DMA request and its completion. We use a round-robin schedule for the threads (see Fig. 3). Through instruction-level pro?ling, it is possible to determine the minimum number of SM-Threads that are needed to hide the memory latency. Note that utilizing more SM-Threads than required incurs a signi?cant

SPE 1

SPE 2

SPE i

SPE p SPE computation

SPE computation divided among SM-Threads

Round-robin policy to schedule the SM-Threads

Fig. 3. Illustration of the technique.

Author's personal copy

D.A. Bader et al. / Parallel Computing 33 (2007) 720–740

727

overhead. Each SM-Thread introduces additional computation and also requires memory on the limited local store. Thus, we have a trade-o? between the number of SM-Threads and latency due to DMA stalls. In the next section, we will use this technique to e?ciently implement list ranking on the Cell/B.E. 4.4. Implementation Our Cell/B.E. implementation (described in high-level in the following four steps) is similar to the Helman– ? ? JaJa algorithm. Let us assume p SPEs in the analysis: (1) We uniformly pick s head nodes in the list and assign them to the SPEs. So, each SPE will traverse s/p sublists. (2) Using these s/p sublists as independent SM-Threads, we adopt the latency-hiding technique. We divide the s/p sublists into b DMA list transfers. Using one DMA list transfer, we fetch the next elements for a set of s/ pb lists. After issuing a DMA list transfer request, we move onto the next set of sublists and so forth, thus keeping the SPU busy until this DMA transfer is complete. Fig. 4 illustrates Step 3 of this algorithm. We maintain temporary structures in the local store (LS) for these sublists, so that the LS can create a contiguous sublist out of these randomly scattered sublists, by creating a chain of next elements for the sublists. After one complete round, we manually revive this SM-Thread and wait for the DMA transfer to complete. Note that there will be no stall if we have su?cient number of SM-Threads (we determine this number in Section 4.5) to hide the latency. We store the elements that are fetched into the temporary structures, initiate a new DMA list transfer request for fetching the successors of these newly fetched elements, and move on to the next set of sublists. When these temporary structures get full, we initiate a new DMA list transfer request to transfer back these elements to the main memory. At the end of Step 2, we have the pre?x sum of each node within the sublist for each sublist within the SPU. Also, we have the randomly scattered sublists stored into a contiguous area of memory. (3) Compute the rank of each sublist head node using the PPU. The running time for Step 2 of the algorithm dominates over the rest of algorithm by an order of magnitude. In the asymptotic notation, this step is O(n). It consists of an outer loop of O(s) and an inner loop of

Fig. 4. Step 2 of list ranking on the Cell/B.E. (a) Linked list for which list ranking is to be done. Highlighted nodes in this ?gure are allocated to SPE(i); (b) View from SPE(i), it has s/p sublist head nodes to traverse concurrently; (c) This array is used to store sublists in contiguous area of memory. When this gets full, we transfer it back to the main memory.

Author's personal copy

728

D.A. Bader et al. / Parallel Computing 33 (2007) 720–740

O(length of the sublist). Since the lengths of the sublists are di?erent, the amount of work performed by each SM-Thread di?ers. For a large number of threads, we get su?cient computation for the SPE to hide DMA ? ? latency even when the load is imbalanced. Helman and JaJa [11,12] established that with high probability, no processor would traverse more a?s? n elements for a(s) P 2.62. Thus, the load is balanced among various p SPEs under this constraint. In our implementation, we incorporate recommended software strategies [5] and techniques to exploit the architectural features of the Cell/B.E. For instance, we use manual loop unrolling, branch hints, and design our implementation for a limited local store. 4.5. Performance results We report our performance results from actual runs on a IBM BladeCenter QS20, with two 3.2 GHz Cell/ B.E. processors, 512 KB Level 2 cache per processor, and 1 GB memory (512 MB per processor). We use one processor for measuring performance and compile the code using the gcc compiler provided with Cell SDK 2.0, with level 3 optimization. Similar to [11,3] we use two classes of lists to test our code, Ordered and Random. An ordered list representation places each node in the list according to its rank. Thus node i is placed at position i, and its successor is at position i + 1. A random list representation places successive elements randomly in the array. Our signi?cant contribution in this paper is a generic work partitioning technique to hide memory latency. We demonstrate the results of this technique for list ranking: we use SM-threads to vary the number of outstanding DMA requests on each SPE, as well as partition the problem and allocate more sublists to each SPE. Fig. 5 shows the performance boost we obtain as we tune the DMA parameter. From instruction-level pro?ling of our code we determine that the exact number of computational clock cycles between a DMA transfer request and its completion are 75. Comparing this with the DMA transfer latency (90 ns, i.e. about 270 clock cycles) suggests that four outstanding DMA requests should be su?cient for hiding the DMA latency. Our results con?rm this analysis and we obtain an improvement factor of 4.1 using eight DMA bu?ers. In Fig. 6, we present the results for load balancing among the eight SPEs, as the number of sublists are varied. For ordered lists, we allocate equal chunks to each SPE. Thus, load is balanced among the SPEs in this case. For random lists, since the length of each sublist varies, the work performed by each SPE varies. We achieve a better load balancing by increasing the number of sublists. Fig. 6 illustrates this: load balancing is better for 64 sublists than the case of eight sublists. We present a performance comparison of our implementation of list ranking on the Cell/B.E. with other single processor and parallel architectures. We consider both random and ordered lists with 8 million nodes. Fig. 7 shows the running time of our Cell/B.E. implementation compared with e?cient implementations of list ranking on the following architectures: Intel_x86: Intel_i686: Intel_ia64: SunUS_III: Intel_WC: MTA-[1,2,8]: SunUS-[1,2,8]: 3.2 GHz Intel Xeon processor, 1 MB L2 cache, Intel C compiler v9.1. 2.8 GHz Intel Xeon processor, 2 MB L2 cache, Intel C compiler v9.1. 900 MHz Intel Itanium 2 processor, 256 KB L2 cache, Intel C compiler v9.1. 900 MHz UltraSparc-III processor, Sun C compiler v5.8. 2.67 GHz Intel Dual-Core Xeon 5150 processor (Woodcrest), Intel C compiler v9.1. 220 MHz Cray MTA-2 processor, no data cache. We report results for 1, 2, 8 processors. 400 MHz UltraSparc II Symmetric Multiprocessor system (Sun E4500), 4 MB L2 cache, Sun C compiler. We report results for 1, 2 and 8 processors.

For Intel Xeon 5150 (Woodcrest) we use a parallel implementation [3] running on two threads (see Table 1). Table 2 gives the comparison of running time of various steps in the list ranking algorithm. The table shows that Step 2 dominates the entire running time of the algorithm. Finally, we demonstrate a substantial speedup of our Cell/B.E. implementation over a sequential implementation using the PPE-only. We compare the best

Author's personal copy

D.A. Bader et al. / Parallel Computing 33 (2007) 720–740

729

Fig. 5. Achieving latency tolerance through DMA parameter tuning for ordered (top) and random (bottom) list of size 220.

sequential approach that uses pointer-chasing to our algorithm using di?erent problem instances. Fig. 8 shows that for random lists we get an overall speedup of 8.34 (1 million vertices), and even for ordered lists we get a speedup of 1.5. 5. Data compression/decompression on the Cell Broadband Engine We now discuss the next case study of optimizing compression/decompression on the Cell/B.E. processor using zlib [9].

Author's personal copy

730

D.A. Bader et al. / Parallel Computing 33 (2007) 720–740

Fig. 6. Load balancing among SPEs for ordered (top) and random (bottom) for list ranking on the Cell/B.E. with a list of size 220.

5.1. Description of zlib algorithm The zlib library is based on the LZ77 algorithm [26] and Hu?man coding [13]. In the LZ77 algorithm, a data stream is compressed by ?rst identifying the longest repeating string, and then replacing the duplicated string with a reference to its previous occurrence. This reference is represented by a length–distance pair, where length speci?es the length of the repeated string and distance gives the o?set of this occurrence. During compression, a hash table is constructed and used to make the process of ?nding matches faster. For strings that do not have a previous match, the input literal contained in the string is added to the compressed data. A further compression is achieved by Hu?man coding. Hu?man coding enhances the compression ratio by

Author's personal copy

D.A. Bader et al. / Parallel Computing 33 (2007) 720–740

731

Fig. 7. Performance of list ranking on the Cell/B.E. as compared to other single processor and parallel architectures for lists of size 8 million nodes. The speedup of the Cell/B.E. implementation over the architectures is given above the respective bars.

encoding symbols with varying bit lengths according to the frequency of occurrences. In particular, shorter codes are assigned for more frequent symbols, and longer codes are assigned for less frequent symbols. During decompression, length–distance pairs and literals are retrieved from Hu?man-coded data and the reference pairs are replaced to obtain original data from the compressed data. In Fig. 9, we have an illustration of the LZ77 algorithm. Suppose the current sliding window covers literals starting with index 8 in the input sequence. The hash function generates a hash key by using the ?rst 3 bytes (index from 8 to 10) of the input string that helps locate the most recent occurrence of this string. The algo-

Author's personal copy

732

D.A. Bader et al. / Parallel Computing 33 (2007) 720–740

Table 1 Speedup of our Cell/B.E. implementation of list ranking as compared with other architectures Architecture Intel_x86 Intel_i686 Intel_ia64 SunUS_III Intel_WC MTA-1 SunUS_II-1 SunUS_II-2 SunUS_II-8 Speedup (ordered) 0.85 0.9 1.5 5.4 0.82 5.0 42.5 23.9 10.8 Speedup (random) 4.15 4.1 5.5 12.1 2.5 1.05 14.5 8.5 3.1

Table 2 Running time comparison of the various steps of list ranking on Cell/B.E. for ordered and random lists of size 8 million nodes Task Step 1: Identi?cation of sublist head nodes Step 2: Traversal of sublists Step 3: Computing the ranks Ordered (%) 0.09 99.84 0.07 Random (%) 0.02 99.96 0.02

rithm also maintains a backward chain for the strings with the identical hash key value using a separate data structure. This is used during a backward traversal, during which a comparison is made with each string to identify the match length. The string that has the longest match is located and the repeated string is replaced by a reference to the located string. Also, the hash table is updated to point to this location and the backward chain is updated to start from this location. Several input parameters in zlib result in a trade-o? between compression ratio, speed and memory requirements. One such parameter is the window size, which determines the size of the recent input data to be stored for searching matches. A large window size can help achieve a better compression ratio, but results in a large working set. Another con?guration option is the hash table size. A big hash table leads to less collisions that makes the program run faster. It also helps achieve a better compression ratio, but leads to higher memory utilization. The compression/decompression stages in zlib are highly dependent in data processing which makes it dif?cult to parallelize. For example, during decompression of the Hu?man-coded data, the stream of input signals needs to be decoded sequentially. This is because the symbols have varying bit lengths and we do not know where the next symbol starts before its previous symbol is decoded. Also, Hu?man-coded data is a mix of literals and length–distance pairs. In order to determine if the next symbol is a literal, length, or distance, the previous symbol must be decoded. Another design issue is that the zlib algorithm requires a substantial number of table lookups. For example, ?nding matches and decoding the Hu?man-coded data stream require accessing the hash and code tables, respectively. Since the SPE does not support scatter and gather type memory access to the local store, this cannot be vectorized. In addition, zlib algorithm has many branches which are dependent on input data and hard to predict statically. For a stream with poor compression ratio, almost every symbol is a literal. On the other hand, for highly compressible streams, most symbols are length–distance pairs. This leads to different branch behaviors and due to the absence of a dynamic branch predictor in the SPEs, these branch mispredictions reduce the performance. 5.2. Applying the complexity model to zlib We apply the complexity model discussed in Section 3.1 to analyze zlib algorithm. The computational complexity of the sequential algorithm is O(n), as we make one pass over the data stream and the hash table lookups are constant-time. In our parallel implementation of zlib, we partition the input data into multiple blocks.

Author's personal copy

D.A. Bader et al. / Parallel Computing 33 (2007) 720–740

733

Fig. 8. Performance comparison of sequential implementation on PPE to our parallel implementation of list ranking on the Cell/B.E. for ordered (top) and random (bottom) lists.

We the work-queue strategy detailed in Section 5.4 for load-balanced computation. Thus, TC,SPE becomes use O n . p Let us assume that the DMA transfer block size is b. TD, the complexity of DMA requests, is given by O(n/pb). Note that in zlib the DMA transfer block size is larger than that of list ranking, where we use a block size of O(1). From Step 2 of the algorithm analysis procedure, since the complexity of DMA fetches is a

Author's personal copy

734

D.A. Bader et al. / Parallel Computing 33 (2007) 720–740

Fig. 9. An illustration of the LZ77 algorithm.

function of n, we analyze the computational complexity in the average case between a DMA request and its completion. For zlib this is given by O(b). Since this is not a function of the input parameters, further analysis is required to determine the amount of data bu?ering required. The zlib algorithm is branch intensive, and branch instructions on the Cell/B.E. are hard to predict statically. TB, the number of branch instructions, is O(n). However, we can still achieve speed-up by parallelization, vectorization and other Cell/B.E. processor speci?c optimizations, which suggests that the Cell/B.E. processor can be used for wider range of applications. 5.3. Optimizing zlib for the Cell Broadband Engine The ?rst step towards optimization of an algorithm on the Cell/B.E. involves the identi?cation of the compute-intensive parts. Table 3 shows the result of pro?ling for the compression routine in zlib. During compression, a new string is inserted into the hash table whenever a match is found. This step also consists of a intensive loop that iterates over for the length of the match and inserts a single byte into the hash table in each iteration. The hash key is computed using the 3 bytes starting from the inserting byte, with an overlap of 2 bytes that were considered during the previous iteration. In the original implementation, this overlap is exploited and the hash key is computed from the previous key to reduce the amount of computation. This gives performance bene?t on a scalar machine. However, this generates a false dependency which makes it tough to vectorize for the Cell/B.E. processor. We break this dependency by calculating the hash key using all 3 bytes in every iteration without using the previous hash key. This generates redundancy in computation but improves performance in overall by vectorization.

Table 3 Percentage of execution time of the various tasks during compression Task Fill window for previous data to ?nd matches Find the longest matches in LZ77 algorithm Update the output bu?ers and statistics according to the availability of the matches Perform Hu?man coding on the data Execution time (%) 11.73 32.69 36.89 11.81

Author's personal copy

D.A. Bader et al. / Parallel Computing 33 (2007) 720–740

735

Another step in compression requires ?nding the longest match where a comparison is made between the current string and the previous strings. This step consists of intensive loops with byte comparisons, which can be implemented using the 16 byte byte-wise vector comparison instruction. This optimization leads to a signi?cant speed up if data has high redundancy, whereas for a less redundant input data comparison is made with only small number of bytes which reduces the performance bene?t. In the compression process, zlib manages a sliding window for ?nding the longest match using the LZ77 algorithm. The window is divided into two parts, the ?rst half stores the previous data stream that is used to ?nd the matches and the second half stores the current stream that needs to be compressed. When the current pointer reaches to the end of the sliding window, the second half of the sliding window is copied to the ?rst half, and a new data stream is loaded to the second half. Accordingly, the hash table and backward chain, which have the pointers to the sliding window, need to be updated. This is a intensive loop and the iteration count of the loop depends on window and hash table size. We vectorized this loop and manually unroll it to achieve better pipeline utilization. The pointer variable in this loop is 2 bytes long, thus vectorization reduces the loop iterations by a factor of eight. Table 4 shows the result of pro?ling for the decompression routine in zlib. The most compute-intensive task during decompression is ?rst, converting the Hu?man-coded data to literals or length-distance pairs and then, replacing the references with the original data. The ?rst step, is highly data-dependent, which makes it hard to vectorize. The second step is essentially a byte-wise memory copy that can be vectorized. We can copy 16 bytes at once using vector instructions. However, if there is an overlap between the source and destination vectors this technique does not work (Fig. 10). To overcome this problem we shu?e the source vector using a table lookup for shu?e pattern, and copy the generated output vector to the destination. Another computation intensive step during decompression involves the calculation of CRC values. CRC algorithms for SIMD architectures have been studied by Joshi et al. [16] and Lin [19]. Lin [19] introduced the concept of ‘congruent equivalence’, and replaced table lookups with the shu?e instructions. Table lookups

Table 4 Percentage of execution time of the various tasks during decompression Task Convert Hu?man-coded data to literal, and length–distance pairs, and copy data from previous input according to these pairs Construct table for decompression Calculate the CRC value Parse header data Execution time (%) 77.25 8.72 7.96 5.97

Source

Destination

Original 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Source

Destination

Output using vector copy 1

Source

2

1

2

3

4

5

6

7

8

9

10

11

12

13

14

Destination

Desired Output 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2

Fig. 10. Illustration of problem in byte-wise memory copy using vector instruction.

Author's personal copy

736

D.A. Bader et al. / Parallel Computing 33 (2007) 720–740

can be replaced with the shu?e instructions if table size is equal to or smaller than 32 bytes. Joshi et al. and Lin used nibble-wise table lookup instead of traditional byte-wise lookup to reduce the table size. These approaches are attractive for the Cell/B.E. processor as the SPE supports the shu?e instruction. We also vectorize the Adler32 algorithm [20], which can be used instead of the crc32( ) function. We use statistical approaches for branch hinting and loop optimization. Loop unrolling works for intensive loops, but adds overhead and increases code size. This does not help for loops with a small number of iterations. So we identify the compute-intensive loops and apply optimization techniques on them. This gives a performance boost and saves local store space. Branch hint instruction bene?ts especially when the branch o?set is large, and branch is highly predictable statically. When compiler encounters branch hint intrinsic, compiler inserts branch hint instruction which facilitates prefetching the instruction and also make hinted case as the fall through case if possible. We use branch hint intrinsics wherever possible to reduce the number of branch mispredictions. Reducing memory usage is also important in the optimization. In a virtual memory system, unused function code resides on disk and never loaded to the main memory or the cache. In the SPE, the entire executable image is loaded to the local store. Considering that the size of local store is only 256 KB, this is undesirable. We separate compression and decompression routines to reduce code size. 5.4. Optimizing gzip on the Cell Broadband Engine Gzip is an application that uses the zlib library for ?le-compression and decompression. The zlib algorithms are highly data-dependent which makes it hard to partition the input data for parallel processing. We use full ?ushing (supported by zlib API) to break this data dependency. We can then restart the algorithm after the full ?ush point, making the algorithm data-dependent only between two consecutive ?ush points. This helps in partitioning the data and enables parallel processing. However, during decompression a complete search of the compressed ?le is required to ?nd the ?ush points (0, 0, 0x?, 0x?). This is an additional overhead particularly for the ?les compressed without full ?ushing. By extending gzip header format we can avoid this search overhead keeping the ?le compatible with the legacy gzip decompressors. We use an extra ?eld to indicate that the ?le has multiple blocks with full ?ush between blocks. The size of the each block is also included in the header in order to avoid full search to ?nd synchronization points. During the decompression stage, the decompressor inspects the ?ags. If the sub-?eld ID for Cell/B.E. gzip extension is found on systems with the Cell/B.E. zlib, it decompresses the ?le in parallel using multiple SPEs. If not, it sequentially decompresses the ?le as a normal gzip ?le. A legacy decompressor would ignore the extra ?eld with unknown sub?eld ID and skip the bytes. Therefore, the ?le can still be decompressed correctly. Full ?ushing and additional information will slightly increase the ?le size by a negligible amount (0.2% increase for 900 KB block size). The new ?eld is structured as follows: ? ? ? ? Byte 1 stores the character ‘C’ which stands for Cell /B.E. Byte 2 stores the character ‘E’ which stands for Extension. We use 2 bytes for storing n, number of blocks. We use 4 bytes for storing the size of each block, and this requires 4n bytes.

We also consider load balancing and ?le I/O delay in our parallel implementation. It is important to note that dividing input ?le into chunks of equal size will not distribute work uniformly among the processors, as the compression time is highly dependent on the input ?le characteristics. We achieve load balancing by partitioning input ?le to multiple blocks. Each SPE is initially assigned with a work-task, and on completion it picks up another work-task from the work queue. We also create two additional threads for ?le read and write. The ?le read thread reads the data to be processed in advance and ?ll the memory bu?er. The write thread writes the computed result in background. 5.5. gzip performance results The gzip code is compiled with gcc provided with the Cell SDK 2.0 using optimization level 5 (?O5). We used the Cell SDK simulator for measuring clock cycles, and the IBM QS20 Cell/B.E. blade for running time

Author's personal copy

D.A. Bader et al. / Parallel Computing 33 (2007) 720–740 Table 5 Parameters used for compiling the zlib library zlib library Basic Compression Decompression windowBits 13 13 15

737

memLevel 6 7 –

Fig. 11. Speedup of Cell/B.E. optimized gzip application on a single SPE for compression (top) and decompression (bottom).

Author's personal copy

738

D.A. Bader et al. / Parallel Computing 33 (2007) 720–740

measurements. The parameters used to compile the original zlib library, Cell/B.E. optimized compression, and decompression, are given in Table 5. The total memory use is given by ? Compression: 2windowBits+2 + 2memLevel+9 bytes. ? Decompression: 2windowBits+2 bytes. Fig. 11 shows the speedup obtained for di?erent ?le types using our optimized SPE implementation as compared to the gzip program using original zlib library run on the single SPE. The speedup is achieved by SIMDization and other Cell/B.E. speci?c optimizations including branch hints and loop unrolling. The test ?les we have used for performance analysis consist of an already compressed ?le that has very less redundancy, Bitmap ?le 1, 2 and 3 that have increasing level of redundancy, with Bitmap ?le 3 having 100% redundancy. Please refer to Table 6 for the compression ratio of these ?le types. Fig. 11 shows that the speedup obtained, during compression and decompression stages, is highly dependent on the input ?le type where we get performance improvement of over 4 for Bitmap ?le 3 and an improvement of 1.5 for an already compressed ?le. Fig. 12 shows the performance comparison of the Cell/B.E. optimized gzip compression as compared with other single processor architectures. Please refer to Section 4.5 for details of these architectures. For Intel Xeon 5150 (Woodcrest) we ran gzip using a single thread. Due to the unavailability of a opensource parallel gzip application our performance is limited to single processor architectures.

Table 6 Compression ratio compressed file size ? 100 using gzip for the various test ?les original file size Test ?le Compressed ?le (484 KB) Bitmap ?le (355 KB) Text ?le (460 KB) Bitmap ?le 2 (546 KB) Bitmap ?le 3 (2.40 MB) Compression level 1 (%) 100 35.8 11.3 7.84 0.499 Compression level 5 (%) 100 31.0 8.16 7.35 0.158 Compression level 9 (%) 100 29.4 7.48 7.08 0.138

Fig. 12. Performance comparison of Cell/B.E. optimized gzip compression with the original zlib implementation on other single processor architectures.

Author's personal copy

D.A. Bader et al. / Parallel Computing 33 (2007) 720–740

739

6. Conclusions In summary, we present two case studies to illustrate the design and implementation of parallel combinatorial algorithms on the Cell/B.E. processor: list ranking, a fundamental kernel in graph algorithms, and zlib, a data compression and decompression library. We introduce the idea of Software-Managed threads (SMThreads), a generic work-partitioning technique for latency tolerance and load balancing on the Cell/B.E. We apply this technique to develop a fast parallel implementation of list ranking, and con?rm its e?cacy by demonstrating an improvement factor of 4.1 as we tune relevant parameters. Most importantly, we achieve an overall speedup of 8.34 of our implementation over an e?cient PPE-only sequential implementation. Similarly, for the zlib library, we exploit concurrency at multiple levels: we vectorize compute-intensive kernels in compression and decompression, and also parallelize the functions to run on multiple SPEs. Our Cell/B.E. implementation of the gzip application, based on the zlib library, achieves a speedup of 2.9 for compression over a high-end Intel Pentium-4 system. With these case studies, we highlight the potential of the Cell/B.E. processor as an accelerator for combinatorial applications. Acknowledgements This work was supported in part by NSF Grants CNS-0614915, CAREER CCF-0611589, DBI-0420513, ITR EF/BIO 03-31654, and an IBM Shared University Research (SUR) Grant. We would also like to thank Sidney Manning (IBM Corporation) and Vipin Sachdeva (IBM Research) for providing valuable inputs during the course of our research. We also acknowledge Georgia Institute of Technology, its Sony-Toshiba-IBM Center of Competence, and the National Science Foundation, for the use of Cell Broadband Engine resources that have contributed to this research. References

[1] D.A. Bader, G. Cong. A fast, parallel spanning tree algorithm for symmetric multiprocessors (SMPs), in: Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS 2004), Santa Fe, NM, April 2004. [2] D.A. Bader, G. Cong, Fast shared-memory algorithms for computing the minimum spanning forest of sparse graphs, in: Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS 2004), Santa Fe, NM, April 2004. [3] D.A. Bader, G. Cong, J. Feo, On the architectural requirements for e?cient execution of graph algorithms, in: Proceedings of the 34th International Conference on Parallel Processing (ICPP), Oslo, Norway, June 2005. [4] D.A. Bader, S. Sreshta, N. Weisse-Bernstein, Evaluating arithmetic expressions using tree contraction: a fast and scalable parallel implementation for symmetric multiprocessors (SMPs), in: S. Sahni, V.K. Prasanna, U. Shukla (Eds.), Proceedings of the Ninth International Conference on High Performance Computing (HiPC 2002), Bangalore, India, Lecture Notes in Computer Science, vol. 2552, Springer-Verlag, 2002, pp. 63–75. [5] D.A. Brokenshire, Maximizing the power of the Cell Broadband Engine Processor: 25 tips to optimal application performance, IBM developerWorks technical article, 2006. [6] T. Chen, R. Raghavan, J. Dale, E. Iwata, Cell Broadband Engine Architecture and its ?rst implementation, IBM developerWorks technical article, 2005. [7] R. Cole, U. Vishkin, Faster optimal pre?x sums and list ranking, Information and Computation 81 (3) (1989) 344–352. [8] D.E. Culler, R.M. Karp, D.A. Patterson, A. Sahay, K.E. Schauser, E. Santos, R. Subramonian, T. von Eicken, LogP: towards a realistic model of parallel computation, in: Proceedings of the Fourth Symposium Principles and Practice of Parallel Programming, ACM SIGPLAN, May 1993, pp. 1–12. [9] P. Deutsch, J.-L. Gailly, zlib compressed data format speci?cation version 3.3, Internet RFCs, 1996. [10] B. Flachs et al., A streaming processor unit for a Cell processor, in: Proceedings of the International Solid State Circuits Conference, vol. 1, San Fransisco, CA, USA, February 2005, pp. 134–135. ? ? [11] D.R. Helman, J. JaJa, Designing practical e?cient algorithms for symmetric multiprocessors, in: Algorithm Engineering and Experimentation (ALENEX’99), Baltimore, MD, Lecture Notes in Computer Science, vol. 1619, Springer-Verlag, 1999, pp. 37–56. ? ? [12] D.R. Helman, J. JaJa, Pre?x computations on symmetric multiprocessors, Journal of Parallel and Distributed Computing 61 (2) (2001) 265–278. [13] D.A. Hu?man, A method for the construction of minimum-redundancy codes, Proceedings of the IRE 40 (9) (1952) 1098–1101. [14] C. Jacobi, H.-J. Oh, K.D. Tran, S.R. Cottier, B.W. Michael, H. Nishikawa, Y. Totsuka, T. Namatame, N. Yano, The vector ?oatingpoint unit in a synergistic processor element of a Cell processor, in: Proceedings of the 17th IEEE Symposium on Computer Arithmetic, Washington, DC, USA, 2005, IEEE (ARITH’05) Computer Society, pp. 59–67. ? ? [15] J. JaJa, An Introduction to Parallel Algorithms, Addison-Wesley Publishing Company, New York, 1992.

Author's personal copy

740

D.A. Bader et al. / Parallel Computing 33 (2007) 720–740

[16] S.M. Joshi, P.K. Dubey, M.A. Kaplan, A new parallel algorithm for CRC generation, IEEE International Conference on Communications 3 (2000) 1764–1768. [17] J.A. Kahle, M.N. Day, H.P. Hofstee, C.R. Johns, T.R. Maeurer, D. Shippy, Introduction to the Cell multiprocessor, IBM Journal of Research Development 49 (4/5) (2005) 589–604. [18] M. Kistler, M. Perrone, F. Petrini, Cell multiprocessor communication network: built for speed, IEEE Micro 26 (3) (2006) 10–23. [19] B. Lin, Altivec solutions to sequential problems: calculating CRC with scalable congruent equivalent compression, Freescale application note AN2926, January 2006. [20] K. Margaritis, Vectorization of algorithm Adler32 using altivec. freevec.org whitepaper, 2005. [21] D. Pham et al. The design and implementation of a ?rst-generation Cell processor, in: Proceedings of the International Solid State Circuits Conference, vol. 1, San Fransisco, CA, USA, February 2005, pp. 184–185. [22] V. Ramachandran, A general-purpose shared-memory model for parallel computation, in: M.T. Heath, A. Ranade, R.S. Schreiber (Eds.), Algorithms for Parallel Processing, vol. 105, Springer-Verlag, New York, 1999, pp. 1–18. [23] M. Reid-Miller, List ranking and list scan on the Cray C-90, Journal of Computer and System Sciences 53 (3) (1996) 344–356. [24] L.G. Valiant, A bridging model for parallel computation, Communications of the ACM 33 (8) (1990) 103–111. [25] S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, K. Yelick, The potential of the Cell processor for scienti?c computing, in: Proceedings of the Third Conference on Computing Frontiers CF’06, 2006, ACM Press, New York, NY, USA, pp. 9–20. [26] J. Ziv, A. Lempel, A universal algorithm for sequential data compression, IEEE Transactions on Information Theory 23 (3) (1977) 337–343.

相关文章:

更多相关标签:

- Vectorized Data Processing on the Cell Broadband Engine ABSTRACT
- Cell__Broadband_Engine_processor_Design_and_Implementation
- The Potential of the Cell Broadband Engine for Data Mining
- High-Performance Web Site Design Techniques
- On Characterizing Performance of the Cell Broadband Engine Element Interconnect Bus
- On Characterizing Performance of the Cell Broadband Engine Element Interconnect Bus
- Cell Broadband Engine
- The Potential of the Cell Broadband Engine for Data Mining
- Cell__Broadband_Engine_processor_Design_and_Implementation
- A 0.039um2 high performance eDRAM cell based on 32nm High-K technology
- Implementation of a cone-beam backprojection algorithm on the Cell Broadband Engine process
- Particle-in-Cell Simulation Codes in High Performance Fortran
- Chip Multiprocessing and the cell broadband engine
- A combined A431 cell membrane chromatography and online high performance
- Implementation of the mixed-precision high performance LINPACK benchmark on the CELL proces