当前位置:首页 >> 英语学习 >>

Reconfigurable mesh algorithms for image shrinking, expanding, clustering, and template mat


Recon?gurable Mesh Algorithms For Image Shrinking, Expanding, Clustering, And Template Matching*
Jing-Fu Jenq University of Minnesota
Abstract Parallel recon?gurable mesh algorithms are developed for the following image processing problems: shrinking, expanding, clustering, and template matching. Our N×N recon?gurable mesh algorithm for the q-step shrinking and expansion of a binary image takes O (1) time. One pass of the clustering algorithm for N patterns and K centers can be done in O (MK + KlogN), O (KlogNM ), and O (M + logNMK ) time using N, NM, and NMK processors, respectively. For template matching using an M×M template and an N×N image, our algorithms run in O (M 2 ) time when N 2 processors are available and in O (M) time when N 2 M 2 processors are available. Keywords and Phrases recon?gurable mesh computer, parallel algorithms, image processing, shrinking, expanding, clustering, template matching. 1 Introduction Miller, Prasanna Kumar, Reisis and Stout [MILL88abc] have proposed a variant of a mesh connected parallel computer. This variant, called a recon?gurable mesh with buses (RMESH), employs a recon?gurable bus to connect together all processors. Figure 1 shows a 4×4 RMESH. By opening some of the switches, the bus may be recon?gured into smaller buses that connect only a subset of the processors. The important features of an RMESH are [MILL88abc]: 1. An N×M RMESH is a 2-dimensional mesh connected array of processing elements (PEs). Each PE in the RMESH is connected to a broadcast bus which is itself constructed as an N×M grid. The PEs are connected to the bus at the intersections of the grid. Each processor has up to four bus switches (Figure 1) that are software
__________________ * This research was supported in part by the National Science Fundation under grants DCR-84-20935 and MIP 86-17374

and

Sartaj Sahni University of Florida

_ ______________________________________________

(0,0)

(3,3) : Processor : Switch : Link Figure 1 4×4 RMESH _______________________________________
controlled and that can be used to recon?gure the bus into subbuses. The ID of each PE is a pair (i, j) where i is the row index and j is the column index. The ID of the upper left corner PE is (0,0) and that of the lower right one is (N ?1,M ?1). 2. The up to four switches associated with a PE are labeled E (east), W (west), S (south) and N (north). Notice that the east (west, north, south) switch of a PE is also the west (east, south, north) switch of the PE (if any) on its right (left, top, bottom). Two PEs can simultaneously set (connect, close) or unset (disconnect, open) a 1

2 particular switch as long as the settings do not con?ict. The broadcast bus can be subdivided into subbuses by opening (disconnecting) some of the switches. 3. Only one processor can put data onto a given sub bus at any time 4. In unit time, data put on a subbus can be read by every PE connected to it. If a PE is to broadcast a value in register I to all of the PEs on its subbus, then it uses the command broadcast(I). 5. To read the content of the broadcast bus into a register R the statement R := content(bus) is used. 6. Row buses are formed if each processor disconnects (opens) its S switch and connects (closes) its E switch. Column buses are formed by disconnecting the E switches and connecting the S switches. 7. Diagonalize a row (column) of elements is a command to move the speci?c row (column) elements to the diagonal position of a speci?ed window which contains that row (column). This is illustrated in Figure 2. _ ______________________________________________ 1 3 5 4 2 (b) 1st column 1 3 5 4 2 (c) diagonalize 2.1 Window Broadcast The data to be broadcast is initially in the A variable of the PEs in the top left w×w submesh. These PEs have ID (0,0) .. (w ?1,w ?1). The data is to tile the whole mesh in such a way A (i, j) = A (i mod w, j mod w) (A (i, j) denotes variable A of PE (i, j)). The algorithm for this is given in Figure 3. Its complexity is O (w) and is independent of the size of the RMESH. _ ______________________________________________ procedure WindowBroadcast(A, w); {broadcast A values in the upper left w×w submesh } begin for j := 0 to w ?1 do { broadcast column j of the submesh } begin diagonalize the A variables in column j of the w×w submesh so that B (i,i) = A (i, j), 0 ≤ i < w; set switches to form column buses; PE (i,i) broadcasts its B value on column bus i, 0 ≤ i < w; B (k,k mod w) := content (bus), 0 ≤ k < N; set switches to form row buses; PE (k,k mod w) broadcasts its B value on its row bus, 0 ≤ k < N; A (k,i) := content(bus) for i mod w = j, and 0 ≤ k < N; end; end; Figure 3 Window broadcast _ ______________________________________________

1 3 5 4 2

(a) 4th row

Figure 2 Diagonalize 4th row or 1st column elements of a 5×5 window _ ______________________________________________ In this paper we develop RMESH algorithms for some image processing problems. The speci?c problems we consider are: shrinking and expanding (Section 3), clustering (Section 4), and template matching (Section 5). In Section 2, we develop some basic data manipulation algorithms for the RMESH. These are used in subsequent sections to obtain the image processing algorithms. 2 Basic Data Manipulation Operations In this section we de?ne several data manipulation algorithms for RMESH multicomputers. These are used in later sections to develop algorithms for the image processing applications we consider.

2.2 Data Sum Initially, each PE of the N×N RMESH has an A value. Each PE is to sum up the A values of all the N 2 PEs and put the result in its B variable. I.e., following the data sum operation we have :
N ?1N ?1

B (i, j) =

k =0 l =0

Σ Σ A (k,l),

0 ≤ i, j < N

This can be done in O (logN ) time by ?rst performing a pre?x sum [MILL88] and then having PE (N ?1,N ?1) broadcast Sum (N ?1,N ?1) to the remaining PEs in the RMESH. For this, all switches can be closed.

3 2.3 Shift Each PE has data in its A variable that is to be shifted to the B variable of a processor that is s, s > 0, units to the right but on the same row. Following the shift, we have j <s null B (i, j) = A (i, j ?s), j ≥ s A circular shift variant of the above shift requires B (i, j) = A (i, (j ?s) mod N) Let us examine the ?rst variant ?rst. This can be done in O (s) time by dividing the send and receive processor pairs ((i, j ?s), (i, j)) into s+1 equivalence classes as below: class k = {((i, j ?s), (i, j)) (j ?s) mod (s +1) = k} The send and receive pairs in each class can be connected by disjoint buses and so we can accomplish the shift of the data in the send processors of each class in O (1) time. In O (s) time all the classes can be handled. The algorithm is given in Figure 4. The number of broadcasts is s +1. The procedure is easily extended to handle the case of left shifts. Assume that s < 0 denotes a left shift by s units on the same row. This can also be done with s +1 broadcasts. _ ______________________________________________ procedure Shift (s,A,B) {Shift from A (i, j) to B (i, j +s), s > 0} begin All PEs disconnect their N and S switches; for k := 0 to s do { shift class k } begin PE (i, j) disconnects its E switch if (j ?s) mod (s +1) = k; PE (i, j) disconnects its W switch and broadcasts A (i, j) if j mod (s +1) = k; B (i, j) := content(bus) for every PE (i, j) with (j ?s) mod (s +1) = k; end; end; Figure 4 Shifting by s,s > 0 _ ______________________________________________ A circular shift of s can be done in O (s) time by ?rst performing an ordinary shift of s and then shifting A (i,N ?s),...,A (i,N ?1) left by N?s. The latter shift can be done by ?rst shifting A (i,N ?s), then A (i,N ?s +1),..., and ?nally A (i,N ?1). The exact number of broadcasts is 2s +1. Circular shifts of s, s > N/2 can be accomplished more ef?ciently by performing a shift of ?(N?s) instead. For s ≤ N/2, we observe that data from PEs (i, 0), (i, 1), . . . (i,s ?1) need to be sent to PEs (i,s), (i,s +1), . . . ., (i, 2s?1), resepectively. So, by limiting the data movement to within rows, s pieces of data need to use the bus segment between PE (i,s ?1) and (i,s). This takes O (s) time. If only the data on one row of the N×N RMESH is to be shifted, the shifting can be done in O (1) time by using each row to shift one of the elements. The circular shift operation can be extended to shift in 1×W row windows or W×1 column windows. Let RowCircularShift (A,s,W) and ColumnCircularShift (A,s,W), respectively, be procedures that shift the A values by s units in windows of size 1×W and W×1. Let A in and A f , respectively, denote the initial and ?nal values of A. Then, for ColumnCircularShift we have A in (i, j) = A f (q, j) where PEs (i, j) and (q, j) are, respectively, the a = i mod W’th and b = q mod W’th PEs in the same W×1 column window and b = (a ?s) mod W. The strategy of Figure 4 is easily extended so that RowCircularShift and ColumnCircularShift are done using 2s + 1 broadcasts. 2.4 Data Accumulation In this operation PE (i, j) initially has a value I (i, j), 0 ≤ i, j < N. Each PE is required to accumulate M I values in its array A as speci?ed below: A [q ](i, j) = I (i, (j + q) mod M) This can be done using 2M ? 1 broadcasts. The algorithm is given in Figure 5. 2.5 Consecutive Sum Assume that an N×N RMESH is tiled by 1×M blocks (M divides N) in a natural manner with no blocks overlapping. So, processor (i, j) is the j mod M’th processor in its block. Each processor (i, j) of the RMESH has an array X [0..M ?1](i, j) of values. If j mod M = q, then PE (i, j) is to compute S (i, j) such that
M ?1

S (i, j) =

r =0

Σ X [q ](i, (j div M) ? M + r)

That is, the q’th processor in each block sums the q’th X value of the processors in its block. The consecutive sum operation is performed by having each PE in a 1×M block initiate a token that will accumulate the desired sum for the processor to its right and in its block. More speci?cally, the token generated by the q’th PE in a block will compute the sum for the (q +1) mod M’th PE in the block, 0 ≤ q < M. The tokens are shifted left circularly within their 1×M block until each token has visited each PE in its block and arrived at its destination PE. The algorithm is given in Figure 6. The number of broadcasts is 3M ?3 as each row circular shift of -1 takes 3 broadcasts.

4 While in a column adjacent sum it is to compute _ ______________________________________________ procedure Accumulate (A,I,M) { each PE accumulates in A, the next M I values } PE (i, j) disconnects its S switch and connects its W switch, 0 ≤ i, j < N; begin {accumulate from the right} for k := 0 to M ?1 do begin {PEs (i, j) with j mod M = k broadcast to PEs on their left that need their I value} PE (i, j) disconnects its E switch if j mod M = k and then broadcasts I (i, j); A [(k +M ?(j mod M)) mod M ](i, j) := content(bus); end; {accumulate from the left} Each PE (i, j) disconnects its S switch and connects its W switch, 0 ≤ i, j < N; for k := 0 to M ?2 do begin PE (i,k) broadcasts I (i,k), 0 ≤ i < N; A [q +k ](i,N ?q) := content(bus), 1 ≤ q < M ?k; end; end; Figure 5 Data accumulation _ ______________________________________________
M ?1

S (i, j) =

q =0

Σ X [q ]((i +q) mod N, j),

0 ≤ i, j < N

Since the algorithms for both are similar, we discuss only the one for row adjacent sum. The strategy is similar to that for consecutive sum. Each processor initiates a token that will accumulate the desired sum for the processor that is M ?1 units to its left. That is PE (i, j) initiates the token that will eventually have the desired value of S (i, (N +j ?M +1) mod N), 0 ≤ i, j < N. The tokens are shifted left circulary 1 processor at a time until they reach their destination PE. The details of the algorithm are given in Figure 7. As each circular shift by -1 requires 3 broadcasts, the algorithm of Figure 7 requires 3(M ?1) broadcasts. _ ______________________________________________ procedure RowAdjacentSum (S,X,M); begin S (i, j) := X (i, j)[M ?1]; for k := M ?2 down to 0 do begin RowCircularShift (S,N, ?1); S (i, j) := S (i, j) + X [k ](i, j); end; end; Figure 7 Row adjacent Sum _ ______________________________________________

_ ______________________________________________ procedure ConsecutiveSum (X,S,M); { Consecutive Sum of X in 1×M blocks } begin S (i, j) := X [((j mod M)+1) mod M ](i, j), 0 ≤ i, j < N ; for k := 2 to M do begin {circularly shift S in 1×M blocks and add terms } RowCircularShift (S, M,-1) S (i, j) := S (i, j) + X [((j mod M)+k) mod M ](i, j), 0 ≤ i, j < N; end; end; Figure 6 Consecutive sums in 1×M blocks _ ______________________________________________ 2.6 Adjacent Sum We consider two forms of this operation: row adjacent sum and column adjacent sum. In each, PE (i, j) begins with an array X [0..M ?1](i, j) of values. In a row adjacent sum, PE (i, j) is to compute
M ?1

3 Shrinking And Expanding Let I [0..N ?1,0..N ?1] be an N×N image and let B 2q +1 [i, j ] represent the block of pixels: {[u,v ] 0 ≤ u,v < N, max { u ?i , v ?j } ≤ q}} Rosenfeld [ROSE87] has shown that the q-step expansion, E q , and the q-step shrinking, S q , of I are given by the equations: E q [i, j ] = S q [i, j ] =
[u,v ]∈B 2q +1 [i, j ]

max min

{I [u,v ]}, 0 ≤ i, j < N

[u,v ]∈B 2q +1 [i, j ]

{I [u,v ]}, 0 ≤ i, j < N

S (i, j) =

q =0

Σ X [q ](i, (j +q) mod N),

0 ≤ i, j < N

Rosenfeld [ROSE87] presents an algorithm for the k k pyramid computer which computes S 2 ?1 and E 2 ?1 at coarsely resampled points in O (k) time. The complexity is valid for both binary and gray scale images. For unresampled binary image expanding and shrinking in one dimension he developed an O (k 2 ) algorithm to comk k pute S 2 ?1 and E 2 ?1 . The generalization to two dimensions results in a complexity of O (2k ). Ranka and Sahni k k [RANK89] show how S 2 ?1 and E 2 ?1 may be computed in O (k) time on an N 2 PE SIMD hypercube. Their algorithms apply to both binary and gray scale images. E q

5 and S q are easily computed in O (q) time on an N×N mesh connected computer. In this section we develop an algorithm to compute E q in O (1) time on an N×N RMESH. Our algorithm is for the case of a binary image. S q , for binary images, can be similarly computed in O (1) time. Since shrinking and expanding are computationally equivalent, we consider only expansion explicitly. From the equation for E q , it follows that E q [i, j ] = where R q [u, j ] = max
[u,v ]∈B 2q +1 [i, j ]

max

[u, j ]∈B 2q +1 [i, j ]

{R q [u, j ]} ,0 ≤ i, j < N

{I [u,v ]} ,0 ≤ u, j < N

= max { I [u,v ]

j ?v ≤ q }, 0 ≤ u, j < N

The computation of R q may be decomposed into subcomputations as below: leftq [u, j ] = max{I [u,v ] 0 ≤ j ?v ≤ q} rightq [u, j ] = max{I [u,v ] 0 ≤v ?j ≤q} R q [u, j ] = max{leftq [u, j ], rightq [u, j ]} Similarly we may decompose the computation of E q from R q as below: top q [i, j ] = max{R q [u, j ] 0 ≤i ?u ≤ q} bottom q [i, j ] = max{R q [u, j ] 0 ≤u ?i ≤ q} E q [i, j ] = max{top q [i, j ], bottom q [i, j ]} The steps involved in computing R q for a binary image I are given in Figure 8. The complexity is readily seen to be O (1). E q is similary computed from R q in O (1) time. The algorithm of Figure 8 assumes that all switches are initially connected and that if a processor reads a bus and ?nds no value, the value ∞ is used. 4 Clustering The input to the clustering problem is an N×M feature matrix F [0..N ?1, 0..M ?1]. Row i of F de?nes the feature vector for pattern i. Thus there are N patterns represented in F. Each column of F corresponds to a pattern attribute. Thus, M is the number of pattern attributes. The objective of clustering is to partition the N patterns into K sets S 0 ,S 1 , . . . ,SK ?1 . Each Si is called a cluster. Different methods to cluster have been proposed in [BALL85], [DUDA73], [FUKU72], [FU74], [ROSE82], and [TOU74]. Here, we consider the popular squared error clustering technique. In this we begin with an initial (possibly random) partitioning (i.e. clusters) of the patterns and iteratively improve the clusters as described below.

_ ______________________________________________ {Compute R q [i, j ] in variable R of PE [i, j ]} {Assume that I (i, j) = I [i, j ] initially} Step 1 {Compute leftq [i, j ] in variable left of PE (i, j) } {Find nearest 1 on the left } PE (i, j) disconnects its N and S switches if I (i, j) = 1 then PE (i, j) disconnects its W switch and broadcasts j on its bus Step 2 PE (i, j) reads its bus and puts the value read in its T variable if j ?T (i, j) ≤ q then left(i,j) = 1 else left(i,j) = 0 Step 3 {Compute rightq [i, j ] by ?nding nearest 1 on right } PE (i, j) connects its E and W switches. if I (i, j) = 1 then PE (i, j) disconnects its E switch and broadcasts j on its bus {N and S switches are disconnected from Step 1} Step 4 PE (i, j) reads its bus and puts the value read in its T variable if T (i, j) ? j ≤ q then right(i,j) = 1 else right(i,j) = 0 Step 5 {Compute R} R (i, j) := left(i, j) or right(i, j) Figure 8 Computing R for a binary image _ ______________________________________________ The center of cluster Si is a 1×M vector which gives the average of the attribute values for the members of this cluster. The centers of the K clusters may be represented by a K×M matrix C [0..K ?1, 0..M ?1] where C [i,* ] is the center of the i’th cluster Si and C [i, j ] = 1 _ ___ Σ F [q, j ], 0 ≤ i < K, 0 ≤ j < M Si q∈Si

The squared distance, d 2[i,k ], between pattern i and cluster k is de?ned to be the quantity
M ?1

d 2[i,k ] =

Σ (F [i,q ] ? C [k,q ])2 q =0

One pass of the iterative cluster improvement process is given in Figure 9. The serial complexity of one pass of the iterative cluster improvement algorithm is readily seen to be O (NMK). Parallel algorithms for this have been developed by several researchers. Hwang and Kim [HWAN87] have developed an algorithm for a multiprocessor with orthogonally shared memory; Ni and Jain [NI85] have proposed a systolic array; Li and Fang [LI86] have developed an O (KlogNM ) algorithm for an SIMD hypercube with NM processors; and Ranka and Sahni [RANK88a] have developed an O (K + logNMK)

6

_ ______________________________________________ Step 1 [ Cluster reassignment ] Newcluster [i ] := j such that d 2[i, j ] = min {d 2[i,q ]}, 0 ≤ i < N
0≤q <k

Step 2

[ In case of a tie pick the least j ] [ Termination and update ] if NewCluster [i ] = OldCluster [i ], 0 ≤ i < N then terminate else OldCluster [i ] = NewCluster [i ], 0 ≤ i < N Step 3 [ Update cluster centers ] Compute C [*,* ] based on the new clusters Figure 9 One pass of the iterative cluster improvement algorithm _ ______________________________________________

_ ______________________________________________ D 2(i, j) := ∞, 0 ≤ i, j < √N for q := 0 to K ?1 do begin PE (i, j), i √N + j = q broadcasts C [* ](i, j) (i.e., cluster center q) to all PEs; Every PE (a,b) reads the broadcast cluster center into its array A [* ](a,b) and then computes
M ?1

X (a,b) :=

Σ (F [s ](a,b) ? A [s ](a,b))2 s =0

if X (a,b) < D 2(a,b) then [ NewCluster (a,b) := q ; D 2(a,b) := X (a,b)]; end; Figure 10 N processor RMESH algorithm for cluster reassignment _ ______________________________________________

algorithm for an SIMD hypercube with NM processors as well as an O (logNMK ) algorithm for an SIMD hypercube with NMK processors. In [JENQ90], we have developed clustering algorithms for RMESH’s with, respectively, N, NM, and NMK processors. The time complexity of these algorithms is, respectively, O (MK + KlogN), O (KlogMN ), and O (M + logNMK) Because of space limitations, we discuss only the N processor algorithm here. We assume that the N processor RMESH is con?gured as a √N × √N array. Initially each processor contains one row of the feature matrix and the K cluster centers are stored one per processor. For de?niteness, assume that F [q ](i, j) = F [i √N + j, q ], 0 ≤ i, j < √N , 0 ≤ q < M (note that F [q ](i, j) denotes the q’th entry in the array F in processor (i, j) of the RMESH array of processors and F [r,s ] denotes an entry in the feature matrix F); C [q ](i, j) = C [i √N + j, q ], 0 ≤ i √N + j < K, 0 ≤ q < M. The computation of the new cluster assignments (step 1 of Figure 9) can be done as in Figure 10. In iteration q of the for loop the PE that contains cluster center q broadcasts it to all RMESH PEs. Following this, each PE computes the squared distance between this center and the pattern it contains. If this squared distance is smaller than the least squared distance computed so far for this pattern then cluster q becomes the candidate for the new cluster assignment of this pattern. The correctness of Figure 10 is easily established. Each iteration of the for loop takes O (M) time as the center broadcast involves the transmission of M values over a bus that includes all N processors and the computation of the squared distance involves O (M) arithemetic per processor. The overall time for Figure 10 is therefore O (MK). Step 2 of Figure 9 can be done in O (1) time as in Figure 11. The new cluster centers are computed one cluster at a time. To compute the new center of cluster i we need to sum, componentwise, the feature vectors of all

_ ______________________________________________ Step 1 if NewCluster (a,b) = OldCluster (a,b) then mark (a,b) := false else [mark (a,b) := true, OldCluster (a,b) := NewCluster (a,b)] 0 ≤ a,b < √N Step 2 PEs (i, 0) for i even disconnect their S switch PEs (i, √N ?1) for i odd disconnect their S switch PEs (i, j) for j ∈ {0, √N ?1} disconnect their S switch / This results in a snake like bus that connects all PEs Step 3 if mark (i, j) then PE (i, j) disconnects its W switch in case i is odd and its E switch in case i is even, 0 ≤ i, j < √N {Following this the bus of Step 2 extends only up to the ?rst PE on the bus with mark (i, j) = true} Step 4 if mark (i, j) then PE (i, j) broadcasts 1 on its bus; PE [0,0] reads its bus; if a 1 is not read then PE (0,0) terminates the program; Figure 11 N processor RMESH algorithm to update cluster assignments and terminate if necessary _ ______________________________________________ patterns in cluster i and divide by the number of such patterns. The number of patterns in cluster i can be determined by having each PE set its A variable to 1 if its pattern is in this cluster and 0 if it isn’t. Then the data sum operation of Section 2.2 can be used to sum up the A’s in

7 O (logN) time. Let us therefore concentrate on computing the sum of the feature vectors of the patterns in cluster i. For this, we consider two cases M ≥ √N and M < √N . If M ≥ √N , the algorithm of Figure 12 is used. Steps 1 through 3 sum the feature vectors in cluster i along the rows of the RMESH. This sum is eventually stored in the column 0 processors. The strategy employed is quite similar to that used in Section 2.5 to compute consecutive sums. Step 4 then sums these sums and step 5 sends the overall sum to PE (0,0). Each step of Figure 12 other than step 2 takes O (M) time. Step 2 takes O (1) time. So, the overall complexity is O (M). _ ______________________________________________ Step 1 PE (q, 0) initiates M tokens of type 1 one at a time. These tokens travel along row q of the RMESH in a pipelined manner. The j’th token accumulates the sum of the j’th feature value of all patterns in row q that are in cluster i, 0 ≤ q < √N . Step 2 Set up row buses. Step 3 PE (q, √N ?1) broadcasts the tokens back to PE (q, 0). Step 4 PE (0,0) initiates M tokens of type 2 one at a time. These travel along column 0 of the RMESH. The j’th token of type 2 accumulates the sum of the j’th token of type 1 in each of the column 0 processors. Step 5 PE ( √N ?1,0) broadcasts the M type 2 tokens it has received to PE (a,b), a √N +b = i. Figure 12 Feature vector sum for cluster i _ ______________________________________________ If M < √N , tile the √N × √N RMESH with 1×M squares. Consider, ?rst, the case when M divides √N . Now the feature vector sums can be computed using the algorithm of Figure 13. The algorithm is itself explanatory. Steps 1 and 6 each take O (M) time, steps 2 and 5 each take O (logN) time; and steps 3 and 4 each take O (1) time. The overall complexity of the algorithm of Figure 13 is therefore O (M + logN) When M < √N and M doesn’t divide √N , the last two blocks of each row can be considered as a single block of length at most 2M?1 (i.e., the *’d block in each row of Figure 14(b) is combined with the 1×M block immediately to its left). The algorithm is essentially that of Figure 13 except that the consecutive sum operation (step 1) is modi?ed to handle the larger block at the end of each row. This modi?cation still takes O (M) time. Further, the right most √N mod M columns of the

_ ______________________________________________ Step 1 The consecutive sum operation of Section 2.5 is used in each 1×M block of PEs . The data is the feature vectors in the block that correspond to patterns in cluster i. The result is stored in each PE’s A variable. So, the PE in each 1×M block computes, in its A variable, the sum of the j’th feature vector of all patterns in its block that are also in cluster i. Step 2 The A values are summed along the columns of the RMESH. The result is stored in the B variables of the processors in row 0. Step 3 The B values are broadcast down column buses. Following this the B values in PEs in the same column are the same. Step 4 PE (a,b) sets its B value to 0 if a > M ?1 or a ≠ b mod M. Following this the nonzero B values in row a, a < M all correspond to feature a. Step 5 Sum the values on each row using the data sum operation of Section 2.2. Put the result in the D variable of column 0 processors. Step 6 PEs (a, 0), 0 ≤ a < M broadcast their D values serially to PE (c,d), c √N +d = i. PE (c,d) saves the received M values as the sum of the feature vectors of cluster i. Figure 13 Summing the feature vectors in cluster i _ ______________________________________________ RMESH are not involved in steps 2 through 6. The complexity remains O (M + logN). O (M + logN) time suf?ces to compute each cluster center. To compute all K of them takes O (MK + KlogN) time. Since cluster reassignment and update take O (MK) and O (1) time, respectively, the time for one pass of the iterative cluster improvement algorithm is O (MK + KlogN) on an RMESH with N PEs. 5 Template Matching The inputs are an N×N image matrix I [0..N ?1, 0..N ?1] and an M×M template matrix T [0..M ?1, 0..M ?1]. The ouput is the two dimensional convolution, C2D, of I and T which is de?ned as:
M ?1M ?1 u =0 v =0

C 2D [i, j ] =

Σ Σ I[(i +u) mod N,(j+v) mod N]

?T[u,v ], 0 ≤ i, j < N

8 The RMESH algorithms we have developed [JENQ90] have the following characteristics. Algorithm 1 : Algorithm 2 : Algorithm 3 : N 2 processors, O (M) memory per processor, O (M 2 ) complexity. N 2 processors, O (1) memory per processor, O (M 2 ) complexity. [MILL88c]R. Miller, V. K. Prasanna Kumar, D. Resis and Q. Stout, "Image computations on recon?gurable VLSI arrays", Proceedings IEEE Conference On Computer Vision And Pattern Recognition, 1988, pp 925930. [NI85] L. M. Ni and A. K. Jain, "A VLSI systolic architecture for pattern clustering", IEEE Transcations on Pattern Analysis and Machine Intelligence, vol. PAMI 7, no. 1 Jan. 85, pp 80-89. [RANK88a] S. Ranka and S. Sahni "Clustering on A Hypercube Multicomputer", TR 88-15, University of Minnesota. [RANK88b]S. Ranka and S. Sahni "Convolution on SIMD mesh connected multiprocessors", Proceedings 1988 International Conference on Parallel Processing, The Pennsylvania State University Press, 1988, pp 212-217. [RANK89] S. Ranka and S. Sahni "Hypercube algorithms for image transformations", Proceedings of the 1989 International Conference on Parallel Processing, The Pennsylvania State University Press, 1989, pp 2431. [ROSE82] A. Rosenfeld and A. C. Kak, Digital picture processing, Academic Press, New York, 1982. [ROSE87] A. Rosenfeld, "A note on shrinking and expanding operations in pyramids", Pattern Recognition Letters, 1987, vol. 6, no. 4, pp 241-244. [TOU74] J. T. Tou and R. C. Gonzalez, Pattern recognition principles, Addison-Wesley, 1974.

N 2 M 2 processors, O (1) memory per processor, O (M) complexity. While one can obtain O (M 2 ) RMESH template matching algorithms that use N 2 processors by simulating the known mesh connected computer algorithm of this complexity ([RANK88b]), the algorithms we propose are considerably simpler. 6 Conclusions We have developed ef?cient parallel RMESH algorithms for shrinking, clustering, and template matching. 7 References [BALL85] D. H. Ballard and C. M. Brown, Coputer Vision, Prentice Hall, New Jersey, 1985. [DUDA73] R. O. Duda and P. E. Hart, Pattern Classi?cation and Scene Analysis, John Wiley and Sons, New York, 1973. [FU74] K. S. Fu, Syntactic Methods in Pattern Recognition, Academic Press, New York, 1974. [FUKU72] K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, New York, 1972. [HWAN87] K. Hwang and D. Kim, "Parallel pattern clustering on a multiprocessor with orthogonally shared memory", Proceedings of the 1987 International Conference on Parallel Processing, The Pennsylvania State University Press, 1987, pp 913-916. [JENQ90]J. Jenq and S. Sahni, "Recon?gurable mesh algorithms for image shrinking, expanding, clustering, and template matching", University of Florida, 1990. [LI86] X. Li and Z. Fang, "Parallel algorithms for clustering on hypercube SIMD computers", Proceedings IEEE Conference on Computer Vision and Pattern Recognition, 1986, pp 130-133. [MILL88a]R. Miller, V. K. Prasanna Kumar, D. Reisis and Q. Stout, "Data movement operations and applications on recon?gurable VLSI arrays", Proceedings of the 1988 International Conference on Parallel Processing, The Pennsylvania State University Press, 1988, pp 205-208. [MILL88b]R. Miller, V. K. Prasanna Kumar, D. Resis and Q. Stout, "Meshes with recon?gurable buses", Proceedings 5th MIT Conference On Advanced Research IN VLSI, 1988, pp 163-178.

--

--


赞助商链接
相关文章:
更多相关标签: