Design of 2D Filters using a Parallel Processor Architecture
computer science crazy Super Moderator Posts: 3,048 Joined: Dec 2008 
21092008, 11:32 AM
Twodimensional filters are usually part of the implementation of digital image processing applications. These filters process recursive sets of instructions and require high computational speed. Optimized implementations of these filters depend on the use of Application Specific Integrated Circuits (ASICs). A system with multiple parallel processing units is a feasible design option able to achieve the required computational performance. In this paper, a loop transformation algorithm, which allows the efficient utilization of a parallel multiprocessor system, is presented. Uniform nested loops representing the image filters and the available processors are modeled as multidimensional data flow graphs. A new loop structure is generated so that an arbitrary number of processors available in the system can run in parallel. INTRODUCTION Image enhancement and edge detection are well known digital image signal processing applications that may require twodimensional (2D) filterlike computational solutions. These applications usually depend on computation intensive code sections, consisting of the repetition of sequences of operations. They are also characterized by the multidimensionality of the data involved. An effective technique in improving the computing performance of such applications has been the design and use of Application Specific Integrated Circuits (ASICs). This paper presents a new technique applicable to the design of a 2D filter system using multiple parallel processors. A multidimensional retiming algorithm embedded in this new technique provides the fully parallel utilization of the available processors, thus reducing the overall execution time of the filter function. Parallel architectures are an important tool in ASIC design. However, these architectures require a careful partitioning of the problem in order to improve the utilization of the parallel processors [2, 17, 24]. During the circuit design phase, nested loop structures can be coded using hardware description languages, such as VHDL constructs, in order to reduce the design time. However, in VHDL, the loop control indices will represent the number of times a section of the circuit will be replicated in the final synthesis under the assumption that there are no physical or cost constraints in the circuit implementation . In this paper, a multidimensional retiming technique is used to transform the loop in such a way to produce the parallel solution for the problem for a given number of processing units. Such a solution can then be implemented on a standard multiprocessor architecture. Retiming was originally proposed by Leiserson  Saxe, focusing on improving the cycle time of onedimensional problems [13]. Most work done in this area, is subject to limitations imposed by the number of delays (memory elements) existing in a cycle of a data flow graph representing the problem [3, 6, 10, 11, 12, 16, 22, 25]. Other methods focus on multiprocessor scheduling and are also applicable to onedimensional problems [7, 8, 14, 16, 18]. This study focuses on the parallelism inherent to multiimensional applications, ignored by the onedimensional methods. Retiming and other loop transformations have since been applied in areas such as scheduling and parallel processing, with the main goal of exploiting finegrain parallelism in the loop body [4, 15]. Due to the different focus in obtaining parallelism, those techniques are not aimed to improve the execution of parallel iterations in multiprocessor systems. Research by Passos and Sha extended the retiming concept to multidimensional (MD) applications [19]. The multidimensional retiming concept is used in this paper to model the partitioning of the loop among the available processors. Multidimensional retiming brings some advantages to the process, since it is readily applicable to the multidimensional fields considered, eliminating the need for a loop transformation that converts the original problem to one dimension. Another significant advantage of MD retiming is that there are no restrictions on its applicability, not being constrained by the characteristics of the onedimensional methods. Use Search at http://topicideas.net/search.php wisely To Get Information About Project Topic and Seminar ideas with report/source code along pdf and ppt presenaion



seminar topics Active In SP Posts: 559 Joined: Mar 2010 
16032010, 07:37 AM
Design Of 2D Filters Using A Parallel Processor.doc (Size: 87.5 KB / Downloads: 68) Design Of 2D Filters Using A Parallel Processor Architecture ABSTRACT: Twodimensional filters are usually part of the implementation of digital image processing applications. These filters process recursive sets of instructions and require high computational speed. Optimized implementations of these filters depend on the use of Application Specific Integrated Circuits (ASICs). A system with multiple parallel processing units is a feasible design option able to achieve the required computational performance. In this paper, a loop transformation algorithm, which allows the efficient utilization of a parallel multiprocessor system, is presented. Uniform nested loops representing the image filters and the available processors are modeled as multiÃ‚Â¬dimensional data flow graphs. A new loop structure is generated so that an arbitrary number of processors available in the system can run in parallel. An example and a description of the algorithm are presented in the paper. INTRODUCTION : Image enhancement and edge detection are well known digital image signal processing applications that may require twodimensional (2D) filterlike computational solutions. These applications usually depend on computation intensive code sections, consisting of the repetition of sequences of operations. They are also characterized by the multidimensionality of the data involved. An effective technique in improving the computing performance of such applications has been the design and use of Application Specific Integrated Circuits (ASICs). This paper presents a new technique applicable to the design of a 2D filter system using multiple parallel processors. A multidimensional retiming algorithm embedded in this new technique provides the fully parallel utilization of the available processors, thus reducing the overall execution time of the filter function. Parallel architectures are an important tool in ASIC design. However, these architectures require a careful partitioning of the problem in order to improve the utilization of the parallel processors [2, 17, 24]. During the circuit design phase, nested loop structures can be coded using hardware description languages, such as VHDL constructs, in order to reduce the design time. However, in VHDL, the loop control indices will represent the number of times a section of the circuit will be replicated in the final synthesis under the assumption that there are no physical or cost constraints in the circuit implementation . In this paper, a multidimensional retiming technique is used to transform the loop in such a way to produce the parallel solution for the problem for a given number of processing units. Such a solution can then be implemented on a standard multiprocessor architecture. Retiming was originally proposed by Leiserson  Saxe, focusing on improving the cycle time of onedimensional problems [13]. Most work done in this area, is subject to limitations imposed by the number of delays (memory elements) existing in a cycle of a data flow graph representing the problem [3, 6, 10, 11, 12, 16, 22, 25]. Other methods focus on multiprocessor scheduling and are also applicable to onedimensional problems [7, 8, 14, 16, 18]. This study focuses on the parallelism inherent to multiimensional applications, ignored by the onedimensional methods. Retiming and other loop transformations have since been applied in areas such as scheduling and parallel processing, with the main goal of exploiting finegrain parallelism in the loop body [4, 15]. Due to the different focus in obtaining parallelism, those techniques are not aimed to improve the execution of parallel iterations in multiprocessor systems. Research by Passos and Sha extended the retiming concept to multidimensional (MD) applications [19]. The multidimensional retiming concept is used in this paper to model the partitioning of the loop among the available processors. Multidimensional retiming brings some advantages to the process, since it is readily applicable to the multiÃ‚Â¬dimensional fields considered, eliminating the need for a loop transformation that converts the original problem to one dimension. Another significant advantage of MD retiming is that there are no restrictions on its applicability, not being constrained by the characteristics of the onedimensional methods. As an example, consider a filter with transfer function: S{z\ z2) = â€ ! lii*(Ã‚Â«l1iÃ‚Â»2)*zl,*rf^ .Ã‚Â¥] ^.'.W Ã‚Â¦ LI for n1, n2 not simultaneously zero. A multidimensional data flow graph (MDFG) can be used to represent this problem as seen in figure 1(a). Figure 1(b) shows a digital circuit design with multiple functional units (multipliers and adders) and memory elements comprising a singleprocessor system designed to solve the filter problem represented in figure 1(a). The 2D delay (1,1) in the MDFG is usually implemented by a FIFO structure equivalent to a serial implementation of (1/Z1) and (1/Z2) elements. The delay (0,1) is equivalent to (1/Z2)and the delay (1,0) to (1/Z1) .The sequential execution time for this design is equivalent to the serial execution of three additions and one multiplication. The nested loop representation of the filter where as the z1 element represents only one delay or storage element. Assuming that two identical processors, with the internal design shown in figure 1(b), are available for parallel execution of this loop, an optimization problem arises. The z2 1 delay becomes a direct dependency between two consecutive iterations being executed in the two processors. The same delay also produces a onedelay dependency between two consecutive utilizations of the twoprocessor system. This implies a nonparallel execution of the two processors, which the 1D retiming technique cannot change due to the constancy on the number of delays in the cycle involving the two processors and containing the z2 1 delay. Using multidimensional retiming on the MDFG representing the parallel implementation of the circuit, that constraint can be eliminated. In this paper, the focus will be on obtaining parallelism between separate iterations of the loop that can be run on different processors, rather than the fine grain parallelism that optimizes the execution on each individual processing element. As a result, the graph to be optimized will be representive of iterations being executed in parallel . This paper presents the new loop transformation method, based on the multidimensional retiming technique, and its applicability to the design of filters implemented in multiprocessor systems. It starts in the next section with an introduction to the basic principles involved, including an overview of multidimensional retiming. Section 3 shows the theoretical aspects of the processor allocation technique, followed by a description of the utilization of the method in a more complex example. A final section summarizes the concepts introduced in the paper. BASIC PRINCIPLES : A multidimensional data flow graph (MDFG) G = (V, E, d, t) is a nodeweighted and edgeweighted directed graph, where V is the set of computation nodes, E is the set of dependence edges equivalent to the circuit data paths, d is a function representing the MD delay between two nodes, and t is a function representing the computation time of each node. An iteration is the execution of each node in V exactly once. Iterations are described by a vector w, equivalent to a multidimensional index, starting at (0,...,0). Iteration dependencies are represented by vector weighted edges. An edge e from u to v with delay vector d(e) means that the computation of the node v at iteration w depends on the execution of node u at iteration w d(e). Thus, an edge with delay (0,...,0) represents a data dependence within the same iteration. A legal MDFG must have no zerodelay cycle, i.e. the summation of the delay vectors along any cycle cannot be (0,...,0).The iteration space of a loop is defined to be the region in the MD discrete Cartesian space whose points correspond onetoone to the iteration indices. This paper considers loops that present the characteristic of constant dependence vectors, i.e. their data dependencies are at a constant distance in the iteration space. These loops are called uniform loops. Reexamining the example presented in the previous section, the 2D filter for an image of N by N pixels can be represented by the nested loop below: for (i=0; i<N; for (j=0; j<N; j++) { y(ij) = c1*y(i1,j1) + c2*y(ij1) + c3*y(i1,j) + x(ij) } In the MDFG shown in figure 1, the nodes A, B and C represent the three required multiplications, while the remaining nodes show the necessary additions. A multidimensional retiming function r:VÃ‚Â®Zn applied to a graph G=(V, E, d, t), redistributes the nodes in the original iteration along the iteration space of the MDFG G. The result of this process is a new graph, Gr=(V, E, dr, t), in which each iteration still contains one instance of every node in G. This transformation is equivalent to a redistribution of the delay elements in the graph. The multidimensional retiming function r(u) for a node u I G represents the offset between the original iteration containing u and the one after retiming. This offset change implies a corresponding change in the delay vectors (edge weights) associated with the MDFG, so that the original dependencies are preserved. Thus, for an edge e between nodes u and v, dr(e) =d(e) + r(u)  r(v). PROCESSOR ALLOCATION : Given an iteration space representing the execution of a twodimensional nested loop, the allocation of multiple processors is done along a line parallel to the iaxis (outermost index of the loop). A processor dependency graph (PDG) P, defined below, shows the dependencies among the different processors used in the execution of the nested loop. Definition: For a given MDFG G= (V, E, d, t) and ^processors, a PDG P = (P, e, d) is a node weighted and edge weighted direct graph, where P is the set of processors, with V2PV2 = k, e is the set of dependence edges between processors and d is a function representing the MD delay between two processors. An MDFG is transformed to a processor dependency graph (PDG) according to the number of available processors established by the filter designer. The PDG shows the dependency edges among the various processors, assuming they are running consecutive iterations in the original loop code. Figure 3(a) shows the interiteration dependencies originated in processor P0 and required in the example seen in figure 1. Figure 3(b) shows the PDG representation of the same graph in the twoprocessor arrangement. Figure 3: (a) Interiteration dependencies (b) PDG representation In the allocation of processors, and implicitly in the application of the retiming technique, two properties of MDFGs that model loops written with regular constructs need to be noticed . Property 1: Given an MDFG G = (V, E, d, t), a retiming function r = (rx,ry), a node ulV and an edge elE such that u Ã‚Â®e u, applying the retiming function r to u, r(u), does not change d(e). Property 2 :given a nested loop of the form: for (i = 0; i < N; for (j=0; j < N; a[i,j] = f(b[i',j']) b[i,j] = g(a[i"j"]) with dependence vectors (x,y), where x indicates the iteration distance according to the outermost loop and y represents the innermost loop, then x 3 0. Property 3:Given a nested loop as the one shown in property 3.2, represented by a PDG P = (P, e, d), and a pair of processorsp and q I P assigned to iteration instances (i, j) and j) respectively, then for any edge e = p _ q , d(e) > (0,0). As a consequence of property 3, a dependence vector such as (2,3) will not be in the set of dependence vectors associated with the loop. Using these properties, a transformation algorithm can be applied to the PDG for optimal performance. In this algorithm, the memory access time is assumed to be uniform and the loop structures under consideration have constant dependencies. With these assumptions, the MDFG representing the original problem is translated into a PDG according to the lemma below: Lemma :Aprocessor dependency graph PDG P =(P, e, d) is such that, given an MDFG G=(V, E, d, t) and k processors, then " elE connecting nodes u, vIV with d(e) = (x,y) there exists e'Ie connecting processor n, which contains node u, to processor m, which contains node v, with m, nIP, 0 Ã‚Â£ m, n < k, and: m = (n+x) mod k d(e') = (x',y') = (int((n+x)/k),y) In order to guarantee the parallelism of the multiprocessor system, a multidimensional retiming function r(u) = (0,ry) is applied to the PDG to change the dependence edges eliminating sequential processing among different processors. It can be proven that (0,ry) is always a valid multidimensional retiming vector in a multiprocessor environment and, therefore, a convenient retiming function such as (0,1) can be applied and will result in the elimination of direct dependencies from the PDG as stated in the theorem below: Theorem : Given a PDG P=( P, e, d) with at least one edge eIe such that d(e) = (0,0), there exists a retiming function r of the form (0,ry(u)) for all uI P, that applied to P creates Pr=(P, e, dr) such that for any el P, dr(e) > (0,0). This theorem can be proven by analyzing three possible cases and considering properties . The three cases are: Â¢ d(e) = (0,0), which will require the application of the retiming function r, resulting in a dependence dr(e) = d(e) + r > (0,0). Â¢ d(e) = (0,y), according to properties, which after retiming will result in dr(e) >(0,0), Â¢ d(e) = (x,y), with x > 0, which would produce dr(e) = (x, yry). In this case, considering that x > 0, then dr(e)>(0,0). Just as in the case of MDFGs, full parallelism is achieved when all edges between any two processors become nonzero delay edges . The loop bodies are then modified according to the retiming function applied to the PDG. Figure 2(b) shows the interprocessor dependencies after the retiming technique was applied to the PDG in figure 3. A synchronization function is now needed to trigger the execution of each processor after an initialization stage, usually called prologue, which is inherent to retiming. The synchronization function call is controlled by the values of the indices of the loop and the number of the processing unit. When the processor has executed the instructions comprising its mandatory prologue and has already initialized the data required for the parallel execution then the function is activated. The general format of the final code of the loop body for each processor is shown next: Code for processor #n (k =number of processors) for (i = n; i < N; i+k) for (j=0; j<N; if ((i==n) && (j==(ry(pn)  ry(pn+1)) SYNC(pn+1) y(ij) = c1*y(i1j1)+c2*y(ij1)+c3*y(i1j)+x(ij) The entire process is then summarized in the algorithm MRA, which is a modified version of the OPTIMUS algorithm presented . Multiprocessor Retiming Algorithm (MRA): Input: MDFG G = (V, E, d, t), number of processors k; Output: retimed G PDG P = (P,e,d)  transform(G,k); rv  (0,1); MC("u I P)  0 MCmax  0 QueueP  f /* remove original edges with positive delays */ "e I e e  e  {e , s.t. d(e) >(0,/,0)} /* queue nondependent nodes */ QueueP  QueueP E {u I P, s.t. indegree(u) = 0} while QueueP 1 f get(u, QueueP) /* check if u needs to be retimed */ if $ v I P s.t. d(u Ã‚Â® v) = (0,0) /* adjust the MC(u) value */ MC(u)  MC(u)+1 MCmax  max {MC(u),MCmax} endif /* propagate the values to successor nodes of u */ repeat " v I P such that u Ã‚Â® v indegree(v)  indegree(v) 1 MC(v)  max{MC(v), MC(u)} /* check for new nondependent nodes */ if indegree(v) = 0 QueueP  QueueP E {v} endif endrepeat endwhile /* compute the multidimensional retiming */ " u I P r(u) = (0,(MCmax  MC(u))*rv) End EXAMPLE In this section, the processor allocation algorithm is applied to the IIR section of a 2Dfilter design initially presented .Its transfer function is given by: Figure 4. Processor Dependency Graph for the IIR problem This filter requires a loop with iterations containing eight multiplications and seven additions. The corresponding PDG for a threeprocessor system, implementing this filter, is shown in figure 4. Figure 5. Retimed PDG for the IIR problem. After applying the multidimensional retiming to the processor dependency graph, the PDG changes to the structure presented in figure 5. In this PDG, P0 has been retimed twice using the retiming function (0,1), resulting in an overall retiming value r(P0) = (0,2). P1 has been retimed once, with the overall retiming r(P1) = (0,1). P2 did not undergo retiming and therefore its retiming value is r(P2) = (0,0). These retiming values introduce multidimensional delays in all edges representing dependencies between processors. These new delays represent stored data that allow the parallel execution of the tasks assigned to the different processing elements. The code for processors 0 and 1 under the stated conditions becomes: k = 3 (number of processors) /* processor 0 */ for (i = 0; i < N; i+k) for (j=0; j<N;j++) if ((i==0) && (j==1) SYNC(p1) endif filter operations endfor endfor /* processor 1 */ for (i = 1; i < N; i+k) for (j=0; j<N;j++) if ((i==1) && (j==1) SYNC(p2) endif filter operations endfor endfor The SYNC command sends a signal to the named processor, informing that its required data has been computed and allowing it to initiate its execution sequence. Processor P0 triggers processor P1 after a first iteration has been executed, while processor P1 will trigger P2, after 2 iterations of P0 (or one of P1). The nonexistence of (0,0) delays in the resulting graph shows that the iterations can be run in parallel on the threeprocessor system. CONCLUSION This paper has introduced an algorithm based on multidimensional retiming that allows an arbitrary number of processors to work in parallel on applications that make use of nested loop constructs. In particular, this paper demonstrated the application of this new method to the case of a twodimensional image filter. The loops are modeled in the form of multidimensional dependence graphs, which are transformed to multiprocessor equivalent versions. Such loops are retimed in order to guarantee fully parallel execution of the nested loop. After retiming, the modified code for each processor is easily generated. A synchronization signal sent between processors guarantees the availability of initial data and correct execution of the iterations distributed in different processors. The theory and basis for the algorithm were presented and an example was shown illustrating the use of the algorithm. REFERENCES: Â¢ R. Gnanasekaran, "2D Filter Implementation for realtime Signal Processing," IEEE Transactions on Circuits and Systems, Â¢ K. K. Parhi and D. G. Messerschmitt, "FullyStatic RateOptimal Scheduling of Iterative DataFlow. Use Search at http://topicideas.net/search.php wisely To Get Information About Project Topic and Seminar ideas with report/source code along pdf and ppt presenaion



projectsofme Active In SP Posts: 1,124 Joined: Jun 2010 
18122010, 03:18 PM
Parallel Skeltons.ppt (Size: 1.44 MB / Downloads: 62) This article is presented by:Dilbag Singh Gill Parallel Processors Parallel Algorithms & Parallel Skeletons INTRODUCTION In today’s world, people rely more now than ever on computers. Computers are used for many things; communication, sharing information, ordering goods and services. Parallel computing has grown over the past couple of years due to physical limitations and occurs in several different types, bitlevel, instruction level, data, and task parallelism. In order to reduce these problems Parallel Skeletons are used. Parallel skeletons, a form of parallel computing, are created on various types of hardware and help make the parallel program development easier by deriving its components from computational patterns. What is Parallelism? As processing power continues to become cheaper, it is natural to build machines with multiple processors. Parallel processors can execute multiple programs simultaneously instead of concurrently. Can we also write a program such that parallel processors can work on the single program in parallel through some form of cooperation? Yes, the solution will use a parallel algorithm Questions: How can we parallelize an algorithm? How can we handle multiple processors accessing memory at same time? What is the computational complexity? What problems can be parallelized so that they obtain a speedup? [1] & [2] The PRAM Model The parallel random access memory (PRAM, pronounced “p ram”) model of parallel computation consists of p generalpurpose processors, P0, P1, …, Pp1, all of which are connected to a large shared, random access memory M, which is treated as a (very large) array of integers. [3] PRAM Vs Hypercube and Boundeddegree network Although the PRAM model would be difficult and expensive to provide in actual hardware. In it every processor can communicate with other in two steps: One processor writes some data in a memory location on one step, and the other processor reads that location on the next step. A model that more closely resembles some actual hardware is the hypercube. A hypercube has 2d processors for some d (the dimension), each connected to its neighbors. In the hypercube, each processor has its own memory and communicates with the other processors by sending messages. At each step each processor may do some computation, then send a message to one of its neighbors. PRAM Vs Hypercube and Boundeddegree network Another class of models, called boundeddegree networks, restricts the connections still further. In a boundeddegree network, each processor is directly connected to at most d other processors. There are different designs for boundeddegree networks, however, an 88 network is illustrated in Figure 3. Hypercube and boundeddegree networks are more realistic models than the PRAM model, but algorithms for them can be much harder to specify and analyze. The routing of messages among the processors, an interesting problem in itself, is eliminated in the PRAM model.[3] Parallel Addition log(n) steps=time needed n/2 processors needed Speedup = n/log(n) Efficiency = 1/log(n) Applicable for other operations too +, *, <, >, == etc. [2] Parallel skeletons A form of parallel computing, are created on various types of hardware and help make the parallel program development easier by deriving its components from computational patterns. Parallel skeletons are composed of high ordered functions that are categorized as nodewise, bottomup, and topdown computations, and take the form of either binary or rose tree skeletons. [8] & [12] When it comes to parallel program development, parallel skeletons are sometimes referred to as the “building blocks” of parallel programs. This is because parallel skeletons are ready made parts that control operations used to build parallel programs. Skeletons hide difficult parallel implementations to give the users parallel computational patterns in a clear, brief manner. [12] & [13] Specific set of hardware for Parallel Skeleton Computer clusters: A computer cluster is basically a group of computers that are linked together by a local area network. [6] Distributed computing is very similar to computer clusters computing in which the computers in the distributed network are each given a collective task. [8] & [9] Multicore A multicore processor is a system that works on algorithms and calculations, and is composed of two or more CPU’s. [10] SWARM is defined as an open source parallel library that uses multicore processors to their full extent. The framework of SWARM consists of sequential code embedded with parallel skeletons followed by algorithms which divide the code among processing cores. [11] Different classes and primitive of Parallel skeletons The five most commonly used primitive parallel skeletons that clearly represent a data parallel skeleton are the map, reduce, scan, zip, and accumulate functions. Map Map function implements a certain function to every value in a specified region. Reduce:  Reduce function uses a binary operator to condense a list into a single variable or value. Scan:  Scan collects the results of the reduce function. Zip:  It is responsible for combining two lists into a single one. Accumulate:  Accumulate acts on partitioned data and abstracts good combinations of primitive skeletons such as map, reduce, scan, and zip. 


seminar paper Active In SP Posts: 6,455 Joined: Feb 2012 
18022012, 10:36 AM
to get information about the topic Design Of 2D Filters Using A Parallel Processor Architecture full report ppt and related topic refer the link bellow http://topicideas.org/howtodesignof2...chitecture http://topicideas.org/howtodesignof2...llseminar and presentation 


