Logarithmic Time Cost Optimal Parallel Sorting is Not Yet Fast in Practice!
Lasse Natvig Division of Computer Systems and Telematics The Norwegian Institute of Technology, The University of Trondheim N{7034 Trondheim, NORWAY (E-mail: lasse@idt.unit.no) November 29, 1996
When looking for new and faster parallel sorting algorithms for use in massively parallel systems it is tempting to investigate promising alternatives from the large body of research done on parallel sorting in the eld of theoretical computer science. Such \theoretical" algorithms are mainly described for the PRAM (Parallel Random Access Machine) model of computation 13, 26]. This paper shows how this kind of investigation can be done on a simple but versatile environment for programming and measuring of PRAM algorithms 18, 19]. The practical value of Cole's Parallel Merge Sort algorithm 10,11] have been investigated by comparing it with Batcher's bitonic sorting 5]. The O(logn) time consumption of Cole's algorithm implies that it must be faster than bitonic sorting which is O(log n) time|if n is large enough. However, we have found that bitonic sorting is faster as long as n is less than 1:2 10 , i.e. more than 1 Giga Tera items!. Consequently, Cole's logarithmic time algorithm is not fast in practice.
2 21
Abstract
In the past years there has been an increased interest in parallel algorithms. In the eld of parallel complexity theory the so called Nick's Class (NC) has been given a lot of attention. Problems belonging to this complexity class may be solved by algorithms with polylogarithmic (i.e. of O(logk (n)) where k is a constant and n is the problem size) running time and a polynomial requirement for processors. A lot of parallel algorithms have recently been presented to prove membership in this class for various problems. This kind of algorithms are frequently denoted NC-algorithms. Although the main motivation for these new algorithms often is to show that a given problem is in NC, the algorithms may also be taken as proposals for parallel algorithms that are \fast in practice". Unfortunately, it may be very hard to assess the practical value of NC-algorithms. very useful tool when the aim is to prove membership of complexity classes or if asymptotic behavior for some other reason is \detailed enough". It makes it possible to describe and analyze algorithms at a very high level. However, it also makes it possible to hide (willingly or unwillingly) a lot of details which cannot be omitted when algorithms are compared for \realistic" problems of large but limited size. 1
Parallel Complexity Theory|A Rich Source for Parallel Algorithms
The Gap Between Theory and Practice
1 Introduction and Motivation
The work reported in this paper is an attempt to lessen the gap between theory and practice within the eld of parallel computing. Within theoretical computer science, parallel algorithms are mainly compared by using asymptotical analysis (O-notation). This paper gives an example on how the analysis of implemented algorithms on nite problems provides new and more practically oriented results than those traditionally obtained by asymptotical analysis.
Asymptotical analysis Order{notation is a
Unrealistic machine model Few of these
theoretical algorithms are described as implementations on real machines. Their descriptions are based on various computational models. A computational model is the same as an abstract machine. One such model is the CREW PRAM.
Details are swept under the rug Most NCalgorithms originating from parallel complexity theory are presented on a very high level and in a compact manner. One reason is probably that parallel complexity theory is a eld that to a large extent overlaps with mathematics|where the elegance and advantages of compact descriptions are highly appreciated. Another reason may be found in the call for papers for the 30'th FOCS Symposium; A strict limit of 6 pages was enforced on the submitted papers. Considering the complexity of most of these algorithms, a compact description is therefore a necessity.
The certainly most used 16] model for expressing parallel algorithms in theoretical computer science is the P-RAM model proposed by Fortune and Wyllie in 1978 13]. Its simplicity and generality makes it possible to concentrate on the algorithm without being distracted by the obstacles caused by describing it for a more speci c (and realistic) machine model. Its synchronous operation makes it easy to describe and analyze programs for the model. This is exactly what is needed in parallel complexity theory, where the main focus is on the parallelism inherent in a problem.
Implementing algorithms on the PRAM model may be regarded as a paradox. One of the reasons for using such abstract machines has traditionally been a wish to avoid implementation details. However, it may be a viable step in an attempt to lessen the gap between theory and practice. The following is achieved by implementing a \theoretical" parallel algorithm on a CREW PRAM model: A deeper understanding. Making an implementation enforces a detailed study of all aspects of an algorithm. Con dence in your understanding. Verication of large parallel programs is very
di cult. In practice, the best way of getting con dent with one's own understanding of a complicated algorithm is to make an implementation that works. Of course, testing can not give a proof, but elaborate and systematic testing may give a larger degree of con dence than program veri cation (which also may contain errors).
A good help in mastering the complexity involved in implementing a complicated parallel algorithm on a real machine. A
Traditional Use of the CREW PRAM Model
CREW PRAM implementation may be a good starting point for implementing the same algorithm on a more realistic machine model. This is especially important for complicated algorithms. Going directly from an abstract mathematical description to an implementation on a real machine will often be a too large step. The existence of a CREW PRAM implementation may reduce this to two smaller steps.
Denying the practical value of an algorithm.
Implementing On an Abstract Model is Worthwhile
How can an unrealistic machine model be used to answer questions about the practical value of algorithms? It is important here to di erentiate between negative and positive answers. If a parallel algorithm A shows up to be inferior as compared with sequential and/or parallel algorithms for (more) realistic models, the use of an unrealistic parallel machine model for executing A will only strengthen the conclusion. On the other side, a parallel algorithm cannot be classi ed as a better alternative for practical use if it is based on a less realistic model. These advantages of implementing parallel algorithms on the CREW PRAM model are independent of whether a parallel computer that resembles the CREW PRAM model ever will be built.
2 The CREW PRAM Model | Programming and Simulation
This section gives a brief introduction to the CREW PRAM model, and how algorithms may be programmed, executed and measured on a CREW PRAM simulator.
The Original CREW PRAM
Unfortunately, there is a lot of di erent opinions on how the (CREW) PRAM should be programmed. My work is originally inspired by James Wyllie's well-written Ph.D. thesis The Complexity of Parallel Computations 26]. The thesis gives a high-level and succinct description of how a PRAM may be programmed.
Background The P-RAM (parallel random access machine) was rst presented by Steven
Fortune and James Wyllie 13]. It was further elaborated in Wyllie's Ph.D. thesis 26]. The P-RAM is based on random access machines (RAMs) operating in parallel and sharing a common memory. Thus it is in a sense the model in the world of parallel computations that corresponds to the RAM (Random Access Machine) model, which certainly is the prevailing model for sequential computations. Today, the mostly used name on the original P-RAM model is probably CREW PRAM. CREW is an abbreviation for the very central concurrent read exclusive write property. (Several processors may at the same time step read the same variable (location) in global memory, but they may not write to the same global variable simultaneously.)
ming language SIMULA 8, 22]. PIL includes features for processor allocation, activation and synchronization, for interaction with the simulator etc. The powerful and exible standard SIMULA source code level debugger may be used on the PIL programs 15]. The simplicity and generality of the PRAM model, combined with the high-level language and the debugger| have made the development of synchronous parallel MIMD programs to a surprisingly easy task. The system has also been used in connection with teaching of parallel algorithms.
About the Simulator and Time Modelling
Main Properties A CREW PRAM is a very simple and general model of a parallel computer. It has an unbounded number of equal Monitoring facilities The simulator proprocessors. Each has an unbounded local mem- vides the following features for producing data ory, an accumulator, program counter, and a necessary for doing algorithm evaluation: ag indicating whether it is running or not. A Global clock. The synchronous operation CREW PRAM has an unbounded global memof a CREW PRAM implies that one global ory shared by all processors. An unbounded time concept is valid for all the processors. number of processors may write into global The global clock may be read and reset. memory as long as they write to di erent loThe default output from the simulator gives cations. An unbounded number of processors information about the exact time for all may read any location at any time. All procesevents which produce any form of output. sors execute the same program. At each global time step in the computation, each running processor executes the instruction given by its own Specifying time consumption. In the curlocal program counter in one unit of time|and rent version of the simulator, the time used the model is therefore best classi ed as a synon local computations inside each processor chronous MIMD machine. More details may be must be explicitly given in the PIL program found in one of 13, 19, 26]. by the user. This makes it possible for the user to choose an appropriate level of \granParallel Programming May be Easy! ularity" for the time modelling. Further, The algorithms are implemented as PIL prothe explicit time modelling implies the adgrams and executed on a CREW PRAM simvantage that the programmer is free to asulator prototype. PIL is a very simple extensume any particular instruction set or other sion of the high-level, object oriented programway of representing time consumption.
The CREW PRAM simulator has been developed in SIMULA with good help of the DEMOS discrete event simulation package 7]. It runs on SUN workstations under the UNIX operating system. Time is measured in number of CREW PRAM unit time instructions|denoted CREW PRAM time units or simply time units. Each processor is assumed to have a rather simple and small instruction set. (Nearly all instructions use one time unit, some few (such as divide) use more. The instruction set time requirement is de ned as parameters in the simulator|and therefore easy to change.) The time consumption is speci ed as part of the PIL program.
The number of processors used in the algorithm is explicitly given in the PIL program, and may therefore easily be reported together with other performance data. The amount of global memory used may be obtained by reading the user stack pointer. Global memory access statistics. The simulator counts and reports the number of reads and the number of writes to the global memory. DEMOS data collection facilities. The general and exible data collection facilities provided in DEMOS are available; COUNT may be used to record incidences, TALLY may be used to record time independent variables, with various statistics (mean, estimated standard deviation etc.) maintained over the samples, and HISTOGRAM provides a convenient way of displaying measurement data. Note that the monitoring facilities does not interfere with the algorithm being evaluated. Algorithm or experiment speci c monitoring are easily de ned by the user. Further details about PIL and the simulator may be found in 18].
Processor and global memory requirement. 9]. Detailed descriptions of many parallel sort-
ing algorithms is found in Akl's book devoted to the subject 3]. In 1986, there was published a bibliography containing nearly four hundred references 24]. Nevertheless, it is still appearing new interesting parallel sorting algorithms. There are mainly two (theoretical) computational models that have been considered for parallel sorting algorithms | the circuit model and the PRAM model. An early and important result for the circuit model was the odd-even merge and bitonic merge sorting networks presented by Batcher in 1968 5]. A bitonic merge sort network sorts n numbers in O(log n) time. The cost of a parallel algorithm is commonly de ned as the product of its execution time and processor requirement. A parallel sorting algorithm is often said to be optimal (with respect to cost), if its cost is O(n log n).
2
The AKS O(n log n) sorting network The
3 Investigating the Practical Value of Cole's Parallel Merge Sort Algorithm
This section starts by explaining why Richard Cole's parallel merge sort algorithm is an important contribution to the eld of parallel sorting. We describe the main principles of the algorithm, and outlines the implementation. A straightforward CREW PRAM implementation of Batcher's bitonic sorting is presented and compared with Cole's algorithm. The comparison shows how the relative simplicity of bitonic sorting makes it to a better algorithm in practice|in spite of being inferior to Cole's algorithm \in the theory".
3.1 Parallel Sorting|The Continuing Search for Faster Algorithms
The literature on parallel sorting is very extensive. A good survey is A Taxonomy of Parallel Cole's CREW PRAM sorting algorithm Sorting by Bitton, DeWitt, Hsiao and Menon The PRAM model is more powerful than the
rst parallel sorting algorithm using only O(log n) time was presented by Ajtai, Komlos and Szemeredi in 1983 2]. This algorithm is often called the three Hungarians's algorithm, or the AKS-network. The original variant of the AKS-network used O(n log n) processors and was therefore not cost-optimal. However, Leighton showed in 1984 that the AKS-network can be combined with a variation of the oddeven network to give an optimal sorting network with O(logn) time and O(n) processors 17]. Leighton points out that the constant of proportionality of this algorithm is immense and that other parallel sorting algorithms will be faster as long as n < 10 . In spite of being commonly thought of as a purely theoretical achievement, the AKSnetwork was a major breakthrough; it proved the possibility of sorting in O(logn) time, and implied the rst cost optimal parallel sorting algorithm described by Leighton in 1984 17]. The optimal asymptotical behavior initiated a search for improvements for closing the gap between its theoretical importance and its practical use. One such improvement is the simpli cation of the AKS-network done by Paterson 21]. However, the algorithm is still very complicated and the complexity constants remain impractically large 14].
100
circuit model. (Even an EREW PRAM may implement a sorting circuit without loss of e ciency.) Also for the PRAM model, there has been a search for a parallel sorting algorithm with optimal cost. (Some of the important results are reported by Cole 11].) In 1986, Richard Cole presented a new parallel sorting algorithm called parallel merge sort 10]. This was an important contribution, since Cole's algorithm is the second O(logn) time O(n) processor sorting algorithm|the rst was the one implied by the AKS-network. Further, it is claimed to have complexity constants which are much smaller than that of the AKS-network.
3.2 Cole's Parallel Merge Sort
v
. .. ... ... ... .... . ... ... .. . .. . .. .. .. .. .. .. .. . . .. .. .. ..
u
.... ... ..... ..... . . .... . . ... ..... ... ... ... ... ... .... ... . ... ... ... ... ... .... .... .. ... . . .. . .. .. .. .. .. .. .. .. .. .. .. .. . .
x
level l ? 1 level
l
w
A revised version of the original paper is 11]| which has been used as the main reference for my implementation of the CREW PRAM variant of the algorithm. Cole's parallel merge sort assumes n distinct items. These are distributed one per leaf in a complete binary tree|it is assumed that n is a power of 2. The computation proceeds up the tree, level by level from the leaves to the root. Each internal node u merges the sorted sets computed at its children. The algorithm is based on the following log n merging procedure: ( 11] page 771.) \The problem is to merge two sorted arrays of n items. We proceed in log n stages. In the ith stage, for each array, we take a sorted sample of 2i? items, comprising every n=2i? th item in the array. We compute the merge of these two samples".
1 1
(a)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Cole's algorithm|main principles
NewUp(u)
. . . . .. .. .. .
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. . . . . . . . . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. . . . . . . . . . . . . .... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. .. .. .. .. .. .. ... .. .. ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . . .. .. ... .. .. .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. . . . .. .. ... .. .. ... . .. . . ... .. .. ... .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
MergeWithHelp
Up(u)
MergeWithHelp (phase 2) MakeSamples (phase 1)
SampleUp(v)
. . .. .. .. .. .
SampleUp(w)
. . . . .. .. . ..
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. .. .. .. . . . . . . . . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. . . . . . . . . . . . . .... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. .. ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... .. .. .. . .. . . .. . . .. .. .. .. . ..
MakeSamples
Up(v) (b)
Up(w)
Cole made two key observations:
1. Merging in constant time: Given the result of the merge from the (i ? 1)'th stage, the Figure 1: Cole's parallel merge sort algorithm. merge in the ith stage can be done in O(1) Part (a): Arbitrary node u in the binary time. computation tree. Part (b): Computation of NewUp(u) in two phases. 2. The merges at the di erent levels of the tree can be pipelined: This is possible since merged samples made at level l of the tree may be used to provide samples of the appropriate size for merging at the next level above l without losing the O(1) time merging property.
During the progress of the algorithm, each node u stores an array Up(u) of items. The goal of each node u is to make Up(u) into a sorted list containing the items initially stored in the leaves of the subtree rooted at u. Each stage of Cole's merging procedure consists of two phases, see Figure 1. In phase 1, the Up(u) arrays are sampled in a systematic manner to produce the arrays SampleUp(u). In phase 2, the two samples SampleUp(v) and SampleUp(w) (from u's two child nodes) are merged into a new sequence NewUp(u) with help of the array Up(u). Cole describes that one should have one processor standing by each item in the Up, NewUp, and SampleUp arrays. Since the size of these arrays, for each node u, change from stage to stage|the processors must be dynamically allocated to the nodes (i.e. the array elements in Up(u), NewUp(u) and SampleUp(u)) as the computation proceeds from the leaves to the root of the tree. The processor requirement is slightly less than 4n. At the end of every third stage the lowest active level moves one level up towards the top|and the algorithm has exactly 3 logn stages. Also, the highest active level will move one level upwards every second stage. Because of this di erence in \speed" the total number of active levels will increase during the computation, until the top level has been reached. The reader is referred to Cole's paper 11] for further details about the algorithm. The simplicity of the CREW PRAM model and the nice properties of synchronous programs made it relatively easy to develop an exact analytical model of the time requirement for the implementation of Cole's algorithm. Cole's succinct description of the algorithm given in 11] is at a relatively high level giving the programmer freedom to choose between the SIMD or MIMD 12] implementation paradigms. The algorithm have been programmed in a synchronous MIMD programming style, as proposed for the PRAM model by Wyllie 26]. This paper gives only a brief description of the implementation, providing a crude base for discussing its time requirement. Figure 2 outlines the main program in a notation called \parallel pseudo pascal" (PPP) 19]. This notation is inspired by parallel pidgin algol as de ned by Wyllie in 26]|with modernizations from the pseudo language notation used by Aho, Hopcroft and Ull-
CREW PRAM procedure ColeMergeSort (1) (2) (3)
begin
(4) (5) for each processor in P do begin (6) Read facts from the stack; (7) InitiateProcessors; (8) for Stage := 1 to 3 logn do begin (9) ComputeWhoIsWho; (10) CopyNewUpToUp; (11) MakeSamples; (12) MergeWithHelp; end; end; end;
Compute the processor requirement, NoOfProcs; Allocate working areas; Push addresses of working areas and other facts on the stack; assign NoOfProcs processors, name them P;
Figure 2: Main program of Cole's parallel merge sort expressed in parallel pseudo pascal. mann in 1]. When the algorithm starts, one single CREW PRAM processor is running, and the problem instance and its size, n, are stored in the global memory. Statement (1{3) are executed by this single processor. The maximum processor requirement is given by the maximum size of the Up(u), NewUp(u) and SampleUp(u) arrays for all nodes u during the computation. We have 11]:
NoOfProcs =
About the implementation
where n + n=2 + n=16 + P n=128 + : : : = 11n=7 and u jNewUp(u)j = Pu jSampleUp(u)j n + n=8 + n=64 + n=512+ : : : = 8n=7 which is slightly less than 4n. The means that the total number of array elements is bounded above by the given sum. Consider the sum given for the Up arrays. There are n processors (array elements) at the lowest active level, a maximum of n=2 processors at the next level above, and so on. This may be viewed as a pyramid of processors. Each time the lowest active level moves one level up| the pyramid of processors follows so that we
Pu jUp(u)j
X jUp(u)j + (1) u X jNewUp(u)j + X jSampleUp(u)j
u u
part of the processor alloTable 1: Time consumption for the statements forms the dynamic the active levels of the tree, cation. Since both in the implementation of ColeMergeSort. and the size of the various arrays change from stage to stage, information such as the node no, t(1; n) = 34 + 8blog (n=2)c + 8blog nc and item no in the array for that node, must be recomputed for each processor at the start t(2::3; n) = 83 of each stage. The necessary computations are t(4; n) = 42 + 23blog NoOfProcs c easily performed in O(1) time. t(5::6; n) = 13 t(7; n) = 224 + 36blog (n=2)c + 72blog nc CopyNewUpToUp is only a simple procedure that makes the NewUp arrays made in the pret(8::10; n) = 159 vious stage to the Up arrays of the current stage. t(11; n) = 48 MakeSamples produce the array SampleUp(u) t(12; n) = 781 from the array Up(u) for all active nodes in the tree, as was depicted in Figure 1. It is a relatively straightforward task. still have n processors at the lowest active level. In contrast, the O(1) time merging performed Similarly, the NewUp and SampleUp processors by MergeWithHelp is a relatively complicated may be viewed as a \sliding pyramid of proces- a air. It constitutes the major part of the algorithm description in 11], and about 90% of the sors". For a given n, the exact calculation of NoOf- code in the implementation. Of the time used Procs is done by a loop with log n iterations. by MergeWithHelp (781 time units), about 40% (Throughout this paper, logn means log n.) is needed to compute the so called cross ranks The time used by this sequential startup code (Substep 1 and 2, p. 773, 11]), and 43% is used is shown in Table 1. t(i; n) denotes the time to maintain ranks(Step 2, p. 774). used on one single execution of statement i of The time used to perform a Stage (9{12) is the discussed program when the problem size somewhat shorter for the six rst stages than is n. t(j::k; n) is a short hand notation for the numbers listed in Table 1. This is because Pii k t(i; n). some parts of the algorithm do not need to be j performed when the sequences are very short. A general procedure for processor allocation is However, time implemented in the CREW PRAM simulator by used is asfor all stages after the six'th, thetable. given by the constants in the a real CREW PRAM algorithm which is able to Stages 1{6 takes a total of 2525 time units. The allocate k processors in log k time utilizing the total time used by ColeMergeSort on n distinct (standard PRAM 13, 26]) fork instruction in a items, n = 2m may be expressed as binary tree structured \chain reaction". Thus, the time used for processor allocation (state- T(ColeMergeSort ; n) = (2) ment (4)) is as given in Table 1 and Equation t(1::7; n) + 2525 + t(8::12; n) 3((log n) ? 2) 1. Statement (3) and (6) illustrate that a dedi- The reader is referred to 20] for further details cated area (a stack) in the global memory is used about the implementation. to pass variables (such as the problem size) to the processors allocated in statement (4), and 3.3 Bitonic Sorting on a CREW PRAM activated in statement (5). Due to the concur- Batcher's bitonic sorting network 5] for sorting rent read property of the CREW PRAM, state- of n = 2m items consists of m(m + 1) columns ment (6) is easily executed in O(1) time. each containing n=2 comparators (comparison InitiateProcessors computes the static part of elements). A natural emulation on a CREW the processor allocation information. Examples PRAM is to use n=2 processors which are dyare what level (in the \pyramid" as discussed namically allocated to the one active column of above) the processor is assigned to, and the lo- comparators as it moves from the input side to cal processor no within that level. It have been the output side through the network. (The posimplemented by two \divide by 8 loops" result- sibility of sorting several sequences simultaneing in the time consumption shown in the table. ously in the network by use of pipelining is sacThe 3 logn stages each consists of four main ri ced by this method. This is not relevant in computation steps. ComputeWhoIsWho per- this comparison, since Cole's algorithm do not
8 8 8 8 8 2 = = 1 2
The algorithms start with the input in global memory and delivers the output at the same begin place. Note that we have logarithmic scale on (1) assign n=2 processors, name them P; both axes. (2) for each processor in P do begin ? (3) Initiate processors; 6k 64 ?? (4) for Stage := 1 to logn do ? (5) for Step := 1 to Stage do begin Execution time ?? (6) EmulateNetwork; ? ? (7) ActAsComparator; ?? ?? 16k end; ??? ? end; ?? end; ? ?? 4k ?? ? t(1; n) = 42 + 23blog(n=2)c ? t(2::3; n) = 38 ?? t(4; n) = 10 ? ?? t(5::7; n) = 84 1k ? ? ?? ?? Figure 3: Main program and time consumption ??? ? of bitonic sorting emulated on a CREW PRAM. 256 ?? ? ? ? have a similar possibility.) The global memory ? is used to store the sequence when the compu64 tation proceeds from one step (i.e. comparator column) to the next. The main program and its ? time requirement are shown in Figure 3. EmulationNetwork is a procedure which com4 16 64 256 1k 4k putes the addresses in the global memory corProblem size n responding to the two input lines for that comFigure 4: Time consumption (in number of (parparator in the current Stage and Step. ActAsComparator calculates which (of the allel) CREW PRAM instructions) measured two possible) comparator functions that should by running parallel sorting algorithms on the be done by the processor (comparator) in the CREW PRAM simulator for various problem current Stage and Step, performs the function, sizes n (horizontal axis). Note the scale on = Cole's algorithm and writes the two outputs to the global mem- both axes. Legend: (O(log n)), bitonic sorting (O(log n)), = ory. Both procedures are easily done in O(1) odd-even transposition sort (O(n)), = insertime. The total time requirement becomes : tion sort (worst case, O(n )), and ? = insertion T(BitonicSort ; n) = t(1::3; n)+ (3) sort (best case, O(n)). 1 t(4; n) log n + t(5::7; n) 2 logn(log n + 1) Cole's algorithm is the CREW PRAM algorithm described in Section 3.2. The implementation counts about 2000 PIL lines and was de3.4 Comparison Figure 4 shows the time used to sort n integers veloped and tested in about 40 days of work. by Cole's algorithm compared with 1-processor Bitonic sorting is the algorithm outlined in insertion sort, a n=2 processor version of odd- Section 3.3. The implementation counts about even transposition sort, and our bitonic sorting 200 PIL lines and was developed and tested in algorithm. For all algorithms, time is measured about 2 days. in number of CREW PRAM instructions. It Odd-even transposition sort is perhaps the includes the time used on processor allocation. simplest parallel sorting algorithm. Our CREW CREW PRAM procedure BitonicSort
2 2
Table 2: Performance data for the CREW Table 3: Same table as above but with problem PRAM implementations of the studied sorting size n = 256. algorithms. Problem size n is 128. P is short for number of processors used. #R is short for time P cost #R #W total number of read operations from the global Algorithm Cole 21.2 986 20863.8 104.6 65.2 memory, and #W is the total number of writes. 3.4 128 428.2 9.9 9.4 Time and cost is given in kilo CREW PRAM Bitonic 5.3 128 683.7 65.9 34.8 (unit-time) instructions, reads and writes in kilo Odd-Even Insert-worst 592.4 1 592.4 32.9 32.9 locations. Insert-Average 303.3 1 303.3 17.0 16.8 Insert-best 5.6 1 5.6 5.1 0.3 Algorithm time P cost #R #W Cole 18.2 493 8959.3 44.7 28.2 Bitonic 2.6 64 169.0 3.9 3.7 Odd-Even 2.8 64 176.5 16.6 Table 4: Calculated performance data for the 8.0 Insert-worst 148.7 1 148.7 8.3 two CREW PRAM implementations. P is short 8.3 Insert-Average 76.2 1 76.2 4.3 for number of processors. 4.2 Insert-best 2.8 1 2.8 0.3 0.1 Algorithm n time P ColeMergeSort 65536 (64k) 4:5 10 2:5 10 BitonicSort 65536 (64k) 1:2 10 3:3 10 PRAM implementation uses n=2 processors which acts as \odd" and \even" processors in ColeMergeSort 262144 (256k) 5:2 10 1:0 10 an alternating style. Readers unfamiliar with BitonicSort 262144 (256k) 1:5 10 1:3 10 the algorithm are referred to one of 4, 9, 23]. 2 205972 2:3 10 Insertion sort is the algorithm called Insertion ColeMergeSort BitonicSort 2 205194 3:0 10 Sort 2 in Programming Pearls by John Bentley 6] and presented at page 108 of that book. ColeMergeSort 2 208958 4:6 10 It was implemented as a 1-processor CREW BitonicSort 2 211107 5:9 10 PRAM algorithm. The time used by the three parallel algorithms are independent of the actual problem instance (when the problem size is xed). However, in- 2 ), for the last value of n making bitonic sortsertion sort use O(n ) time in the worst case, ing faster than Cole's algorithm, and for the rst and O(n) time in the best case (both shown in value of n making Cole's algorithm to a faster althe gure). We see that bitonic sorting is fastest gorithm. We see that our straightforward implein this comparison in a large range of n starting mentation of Batcher's bitonic sorting is faster than the implementation of Cole's parallel merge at about 256. Table 2 and 3 shows some central performance sort as long as the number of items to be sorted, 1:2 10 , i.e. more than data for small test runs, n = 128 and n = 256. n, is less than 2 1 Giga Tera items! The two rightmost columns show the quantity of the memory use.
4 4 4 4 69 69 70 70 10 2 70 21
5 4
6 5
21 20
21 20
A lot may be learned from medium-sized Bitonic sorting is faster in practice We test runs Highly concurrent algorithms with
have developed exact analytical models for the implementations of Cole's algorithm and bitonic sorting. The models have been checked against the test runs, and have been used to nd the point where Cole's O(logn) algorithm becomes faster than the O(log n) bitonic algorithm. The results are summarized in Table 4. The table shows time and processor requirement for the two algorithms for n = 64k, n = 256k; (k =
2
polylogarithmic running time are often relatively complex. One might think that evaluation of such algorithms would require processing of very large problem instances. So far, this have not been the case. In studying the relatively complex Cole's algorithm, some hundreds of processors and small sized memories have been su cient to enlighten the main aspects of the algorithm. In many cases, the need
for brute force (i.e., huge test runs) may to a large extent be reduced by the following \working rules": 1. The size of the problem instance is used as parameter to the algorithm which is made to solve the problem for all possible problem sizes. 2. Elaborate testing is performed on all problem sizes that are within the limitations of the simulator. 3. A detailed analysis of the algorithm is performed. The possibility of making such an analysis with a reasonable e ort depends strongly on the fact that the algorithm is deterministic and synchronous. 4. The analysis is con rmed with measurements from the test cases. Together, this have made it possible to use the analytical performance model by extrapolation for problem sizes beyond the limitiations of the simulator.
This work has been done during my Ph.D. studies at The Norwegian Institute of Technology (NTH) in Trondheim. I wish to thank my supervisor Professor Arne Halaas for constructive criticisms and for continuing encouragement. The work has been nancially supported by a scholarship from The Royal Norwegian Council for Scienti c and Industrial Research (NTNF).
References
4 Concluding Remarks
2
We can conclude that Batcher's well known and simple O(logn ) time bitonic sorting is faster than Cole's O(log n) time algorithm for all practical values of n. The huge value of n reported in the previous section gives also room for a lot of improvements to Cole's algorithm before it beats bitonic sorting for practical problem sizes. There are also good possibilities to improve the implementation of bitonic sorting. tem for Discrete Event Modelling on SimIn fact, Cole's algorithm is even less practical ula. The MacMillan Press Ltd., London, than depicted by the described comparison of 1979. execution time. This is because it requires about 8 times as many processors than bitonic sorting, 8] Graham M. Birtwistle, Ole-Johan Dahl, Bj rn Myhrhaug, and Kristen Nygaard. and it has a far more extensive use of the global SIMULA BEGIN, second edition. Van Nosmemory. trand Reinhold Company, New York, 1979. The method for investigating PRAM algorithms exempli ed by this paper might con- 9] Dina Bitton, David J. DeWitt, David K. tribute to lessen the gap between theory and Hsiao, and Jaishankar Menon. A Taxonpractice in parallel computing. Reducing this omy of Parallel Sorting. Computing Surgap was recently emphasized as a very imporveys, 16(3):287{318, September 1984. tant research area at the NSF - ARC Workshop on Opportunities and Constraints of Par- 10] Richard Cole. Parallel Merge Sort. In Proceedings of 27th IEEE Symposium on allel Computing 25]. Foundations of Computer Science (FOCS), pages 511{516, 1986. Acknowledgements
1] Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman. Data Structures and Algorithms. Addison-Wesley Publishing Company, Reading, Massachusetts, 1982. 2] M. Ajtai, J. Komlos, and E. Szemeredi. An O(n log n) sorting network. Combinatorica, 3(1):1{19, 1983. 3] Selim G. Akl. Parallel Sorting Algorithms. Academic Press, Inc., Orlando, Florida, 1985. 4] Selim G. Akl. The Design and Analysis of Parallel Algorithms. Prentice Hall International, Inc., Englewood Cli s, New Jersey, 1989. 5] K. E. Batcher. Sorting networks and their applications. In Proc. AFIPS Spring Joint Computer Conference, pages 307{ 314, 1968. 6] Jon L. Bentley. Programming Pearls. Addison-Wesley Publishing Company, Reading, Massachusetts, 1986. 7] Graham M. Birtwistle. DEMOS | A Sys-
11] Richard Cole. Parallel Merge Sort. SIAM 21] M. S. Paterson. Improved sorting netJournal on Computing, 17(4):770{785, Auworks with O(logn) depth. Technical regust 1988. port, Dept. of Computer Science, University of Warwick, England, 1987. Res. Re12] Michael J. Flynn. Very High-Speed Comport RR89. puting Systems. In Proceedings of the IEEE, volume 54, pages 1901{1909, De- 22] R. J. Pooley. An Introduction to Programcember 1966. ming in SIMULA. Blackwell Scienti c Publications, Oxford, England, 1987. 13] S. Fortune and J. Wyllie. Parallelism in Random Access Machines. In Proceedings 23] Michael J. Quinn. Designing E cient Alof the 10'th ACM Symposium on Theory of gorithms for Parallel Computers. McGrawComputing (STOC), pages 114{118. ACM, Hill Book Company, New York, 1987. NewYork, May 1978. 24] D. Richards. Parallel sorting|a bibliogra14] Alan Gibbons and Wojciech Rytter. E phy. ACM SIGACT News, pages 28{48, cient Parallel Algorithms. Cambridge Uni1986. versity Press, Cambridge, 1988. 25] Jorge L. C. Sanz, editor. Opportuni15] Per Holm and Magnus Taube. SIMDEB ties and Constraints of Parallel ComputUser's Guide for UNIX. Technical report, ing. Springer-Verlag, London, 1989. Papers Lund Software House AB, Lund, Sweden, presented at the NSF - ARC Workshop on 1987. Opportunities and Constraints of Parallel Computing, San Jose, California, Decem16] Richard M. Karp. A Position Paper on ber 1988. (ARC = IBM Almaden Research Parallel Computation. In Proceedings of Center, NSF = National Science Foundathe NSF - ARC Workshop on Opportunition). ties and Constraints of Parallel Computing ( 25]), pages 73{75, 1989. 26] J. C. Wyllie. The Complexity of Parallel Computations. PhD thesis, Dept. of Com17] Tom Leighton. Tight Bounds on the Computer Science, Cornell University, 1979. plexity of Parallel Sorting. In Proceedings ACM, New York, 1984. 18] Lasse Natvig. Crew pram simulator| users's guide. Technical Report 39/89, Division of Computer Systems and Telematics, The Norwegian Institute of Technology, The University of Trondheim, Norway, December 1989. 19] Lasse Natvig. The CREW PRAM Model| Simulation and Programming. Technical Report 38/89, Division of Computer Systems and Telematics, The Norwegian Institute of Technology, The University of Trondheim, Norway, December 1989. 20] Lasse Natvig. Cole's Parallel Merge Sort Implemented on a CREW PRAM Simulator. Technical Report 3/90, Division of Computer Systems and Telematics, The Norwegian Institute of Technology, The University of Trondheim, Norway, 1990. Still in preparation.
of the 16th Annual ACM Symposium on Theory Of Computing (May), pages 71{80.