Architectural modifications to enhance the floating point performance of FPGA
science projects buddy|
Active In SP
Joined: Dec 2010
26-12-2010, 09:44 AM
ARCHITECTURAL MODIFICATIONS TO ENHANCE THE FLOATING-POINT PERFORMANCE OF FPGA
DEPARTMENT OF ELECTRONICS AND COMMUNICATION
COLLEGE OF ENGINEERING
Architectural modifications to enhance the floating point performance of FPGA.docx (Size: 696.09 KB / Downloads: 49)
With latest technologies FPGAs have reached the point where they are capable of implementing complex floating-point applications. However the application of FPGA for scientific applications that require floating point operations is limited .In that case an improvement in FPGA architecture for floating point operation comes into consideration .This paper considers three architectural modifications that make floating-point operations more efficient on FPGAs. Before considering about the modifications the present architecture and floating point numbering system are referred. The dominant style of current FPGAs is the island-style FPGA, consisting of a 2-D lattice of CLBs. Based on that three modifications are presenting here. The first modification is an embedded FPU which implements a double-precision floating point multiply-add operation .It is implemented in an island-style FPGA. A first-input–first-output (FIFO) is also provided in parallel to the multiplier so that the pipelines are balanced. These coarse-grained units provide a dramatic gain in area and clock rate at the cost of dedicating significant silicon resources to hardware that is very domain specific. Another feature of floating-point lend themselves to finer grained approaches. Floating point arithmetic requires variable length and direction shifters which can be called variable length shifters. The first alternative to lookup tables (LUTs) for implementing the variable length shifters is a coarse-grained approach: embedded variable length shifters in the FPGA fabric. With this improvement significant reduction in area with a modest increase in clock rate and are smaller and more general than embedded floating-point units. The fine-grained approach is adding a 4:1 multiplexer unit inside a configurable logic block (CLB), in parallel to each 4-LUT. This modification provides the smallest overall area improvement and also a significant improvement in clock rate .But the trivial size of the CLB increases in this case.
Field-Programmable Gate Arrays (FPGAs) are currently used as a means to accelerate scientific applications. Recent scientific applications reveals that this has been the exclusive domain of microprocessors. But the performance of microprocessors is limited by their lack of customizability . In contrast, application specific integrated circuits (ASICs) can be highly efficient at floating-point computations, but they do not have the programmability needed for typical scientific computing environments.Now its possible to implement a variety of scientific algorithms with the help of FPGAs.This is made possible through the Increases in FPGA density, and optimizations of floating-point elements for FPGAs. In spite of this, the floating-point performance of FPGAs must increase dramatically to offer a compelling advantage for this domain
There are still significant opportunities to improve the floating-point performance of FPGAs by optimizing the device architecture. Fixed-point operations have long since become common on FPGAs and hence FPGA architectures have introduced targeted optimizations. Some of them are fast carry-chains, cascade chains, and embedded multipliers. Xilinx has created an entire family of FPGAs optimized for the signal processing domain, which uses this type of operation intensively .Floating-point operations are becoming more common nowadays but there have not been the same targeted architectures for floating-point as there are for fixed-point. Potential architectural modifications span a spectrum from the extremely coarse-grained to the extremely fine-grained. This paper explores ideas at three points in that spectrum. At the coarse-grained end, we evaluate the addition of IEEE 754 standard floating-point multiply-add units as an embedded block in the reconfigurable fabric. Many scientific algorithms need IEEE compliance and most of the algorithms explored can fully leverage floating-point multiply-adds. Since a fused multiply-add can often be smaller than a separate multiplier and an adder, we chose the fused multiply add as the “coarse-grained” enhancement. These coarse-grained units provide a dramatic gain in area and clock rate at the cost of dedicating significant silicon resources to hardware that is very domain specific. IEEE floating-point also has features that lend themselves to finer grained approaches. The primary example is that floating point arithmetic requires variable length and direction shifters. In floating-point multiplication and division, the mantissa must be shifted before the calculation and after the calculation to renormalize the mantissa.
In highly optimized double-precision floating-point cores for FPGAs , the shifter accounts for almost a third of the logic for the adder and a quarter of the logic for the multiplier. Thus, better support for variable length shifters can noticeably improve floating-point performance. Two approaches are done to optimize the FPGA hardware for variable length shifters.First we consider a minor tweak to the traditional CLB: the addition of a 4:1 multiplexer in parallel with the 4-LUT. This was at the fine grained end. This provides a surprisingly large clock rate improvement with a more modest area improvement and virtually no extra silicon area. Next we consider the addition of an embedded block to provide variable length shifting. This uses slightly more area than the configurable logic blocks (CLB) modification and provides a corresponding increase in area savings. This modification is in the middle of the spectrum. It provides only a modest improvement in clock rate.
VPR ( versatile place and route) is a widely used research tool for FPGA place and route. It uses simulated annealing and a timing-based semi-perimeter routing estimate for placement and a timing-driven router. The modified version was used to place and route a set of double-precision floating-point benchmarks. Five benchmarks were used to test the performance. They were matrix multiply, matrix vector multiply, vector dot product, fast Fourier transform (FFT), and LU decomposition. Each benchmark was used with five versions, which were CLB only, embedded multiplier, embedded shifter, multiplexer, and embedded floating-point units.
2. FLOATING-POINT NUMBERING SYSTEM AND OPERATIONS
The IEEE-754 standard specifies a representation for single and double precision floating-point numbers. It is currently the standard that is used for real numbers on most computing platforms. Floating-point numbers consist of three parts: sign bit, mantissa, and exponent. The mantissa is stored as a fraction (f), which is combined with an implied one to form a mantissa (1.f) such that the mantissa is multiplied by the base number (two) to an exponent. The representations of single and double precision numbers are as follows,
A sign bit, an 8-bit exponent, and a 23-bit mantissa are specified for a single-precision floating-point number. Double-precision floating-point has a sign bit, an 11-bit exponent and 52-bit mantissa.
Figure 1.Single precision IEEE floating point number
Figure 2.Double precision IEEE floating point number
There are Floating point operations like addition, subtraction, multiplication and division. For the sake of explanation we will just go through floating point multiplication. The steps involved are given below,
1. Multiply the mantissa parts including the 1 also as, 1.f *1.f
2. Find out E-127 to get true exponent.
3. Add the exponents and re-bias by adding 127.
4. Now we have to normalise the mantissa part to standard form as 1.f
5. For normalising shift the decimal point to left or right and add or subtract 1 from the exponent for every shift
3. BASIC FPGA ARCHITECTURE
All FPGAs will have the following basic architectural components to function properly to meet the requirement of designs. Simply saying an FPGA is a two dimensional array of gates realised using look up tables inside CLBs. The basic architectural components are,
1. CONFIGURABLE LOGIC BLOCKS (CLBS)
3. INPUT/OUTPUT BLOCKS (IOBS)
4. THE STORAGE ELEMENT
5. DISTRIBUTED RAM
6. BLOCK RAM
7. DIGITAL CLOCK MANAGER (DCM) BLOCKS
8. DEDICATED MULTIPLIERS
3.1. CONFIGURABLE LOGIC BLOCKS (CLBS)
CLBs contain flexible Look-Up Tables (LUTs) that implement logic plus storage elements used as flip-flops or latches. CLBs perform a wide variety of logical functions as well as store data. The Configurable Logic Blocks (CLBs) constitute the main logic resource for implementing synchronous as well as combinatorial circuits. Each CLB contains four slices, and each slice contains two Look-Up Tables (LUTs) to implement logic and two dedicated storage elements that can be used as flip-flops or latches. The LUTs can be used as a 16x1 memory (RAM16) or as a 16-bit shift register (SRL16), and additional multiplexers and carry logic simplify wide logic and arithmetic functions. Most general-purpose logic in a design is automatically mapped to the slice resources in the CLBs. Each CLB is identical, and the Spartan-3E family CLB structure is identical to that for the Spartan-3 family.
Each CLB comprises four interconnected slices. These slices are grouped in pairs. Each pair is organized as a column with an independent carry chain. The left pair supports both logic and memory functions and its slices are called SLICEM. The right pair supports logic only and its slices are called SLICEL. Therefore half the LUTs support both logic and memory (including both RAM16 and SRL16 shift registers) while half support logic only, and the two types alternate throughout the array columns. The SLICEL reduces the size of the CLB and lowers the cost of the device, and can also provide a performance advantage over the SLICEM.
3.3. INPUT/OUTPUT BLOCKS (IOBS)
IOBs control the flow of data between the I/O pins and the internal logic of the device. Each IOB supports bidirectional data flow plus 3-state operation. Supports a variety of signal standards, including four high-performance differential standards. Double Data-Rate (DDR) registers are also included. The Input/output Block (IOB) provides a programmable, unidirectional or bidirectional interface between a package pin and the FPGA’s internal logic. The unidirectional input-only block has a subset of the full IOB capabilities. Thus there are no connections or logic for an output path. The following paragraphs assume that any reference to output functionality does not apply to the input-only blocks. The number of input-only blocks varies with device size, but is never more than 25% of the total IOB count.
3.4. THE STORAGE ELEMENT
which is programmable as either a D-type flip-flop or a level-sensitive transparent latch, provides a means for synchronizing data to a clock signal, among other uses. The storage elements in the top and bottom portions of the slice are called FFY and FFX, respectively. FFY has a fixed multiplexer on the D input selecting either the combinatorial output Y or the bypass signal BY. FFX selects between the combinatorial output X or the bypass signal BX.
3.5. DISTRIBUTED RAM
The LUTs in the SLICEM can be programmed as distributed RAM. This type of memory affords moderate amounts of data buffering anywhere along a data path. One SLICEM LUT stores 16 bits (RAM16). Multiple SLICEM LUTs can be combined in various ways to store larger amounts of data, including 16x4, 32x2, or 64x1 configurations in one CLB. The fifth and sixth address lines required for the 32-deep and 64-deep configurations, respectively, are implemented using the BX and BY inputs, which connect to the write enable logic for writing and the F5MUX and F6MUX for reading. Writing to distributed RAM is always synchronous to the SLICEM clock (WCLK for distributed RAM) and enabled by the SLICEM SR input which functions as the active-High Write Enable (WE). The read operation is asynchronous, and, therefore, during a write, the output initially reflects the old data at the address being written.
3.6. BLOCK RAM
provides data storage in the form of 18-Kbit dual-port blocks. Spartan-3E devices incorporate 4 to 36 dedicated block RAMs, which are organized as dual-port configurable 18 Kbit blocks. Block RAM synchronously stores large amounts of data while distributed RAM, previously described, is better suited for buffering small amounts of data anywhere along signal paths. Each block RAM is configurable by setting the content’s initial values, default signal value of the output registers, port aspect ratios, and write modes. Block RAM can be used in single-port or dual-port modes
3.7. DIGITAL CLOCK MANAGER (DCM) BLOCKS
The DCM blocks Provide self-calibrating, fully digital solutions for distributing,
delaying, multiplying, dividing, and phase-shifting clock signals. Digital Clock Managers (DCMs) provide flexible, complete control over clock frequency, phase shift and skew. To accomplish this, the DCM employs a Delay-Locked Loop (DLL), a fully digital control system that uses feedback to maintain clock signal characteristics with a high degree of precision despite normal variations in operating temperature and voltage. This section provides a fundamental description of the DCM.
3.8. DEDICATED MULTIPLIERS
Most of the devices provide 4 to 36 dedicated multiplier blocks per device. The multipliers are located together with the block RAM in one or two columns depending on device density. The multiplier blocks primarily perform two’s complement numerical multiplication but can also perform some less obvious applications, such as simple data storage and barrel shifting. Logic slices also implement efficient small multipliers and thereby supplement the dedicated multipliers. Each multiplier performs the principle operation P = A × B, where ‘A’ and ‘B’ are 18-bit words in two’s complement form, and ‘P’ is the full-precision 36-bit product, also in two’s complement form.
Figure 3. Internal architecture
The above described elements are organized as shown in the figure. A ring of IOBs surrounds a regular array of CLBs. Each device has two columns of block RAM. Each RAM column consists of several 18-Kbit RAM blocks. Each block RAM is associated with a dedicated multiplier. The DCMs are positioned in the canter with two at the top and two at the bottom of the device. The XC3S100E has only one DCM at the top and bottom, while the XC3S1200E and XC3S1600E add two DCMs in the middle of the left and right sides. Al l family features a rich network of traces that interconnect all five functional elements, transmitting signals among them. Each functional element has an associated switch matrix that permits multiple connections to the routing.
4. MODIFICATIONS ON ARCHITECTURE
In the previous sections we have discussed the requirements of floating point operations and from that we can find out some special architecture to improve the performance. Since we have created an idea about the basic FPGA architecture, we are able to suggest the following modifications on the architecture.
1. EMBEDDED FPU
2. EMBEDDED SHIFTER
3. ADDITION OF 4:1 MULTIPLEXER
4.1. EMBEDDED FPU
A floating point unit which can perform specific floating point operations will be a most useful part in the architecture. This leads to the addition of an embedded floating point unit in the architecture. The embedded FPU implements a double-precision floating-point multiply-add operation as shown in Figure. The FPU can be configured to implement a double-precision multiply, add, or multiply-add operation. As seen in Figure, inputs and outputs are registered at the attachment point to the reconfigurable fabric, but the individual functional units (adder and multiplier) are pipelined internally (not shown). The mode input selects data paths to configure the unit as an adder, multiplier, or multiply-add. A first-input–first-output (FIFO) is also provided in parallel to the multiplier so that the pipelines are balanced.
The size of the unit includes both the computation logic and the programmable routing. The amount of programmable routing in each floating-point unit will depend on the height of the unit, since it must have vertical channels at the edge to interface to the rest of the device and a horizontal channel to pass data across the unit. Thus, the area of the embedded floating-point unit was increased to accommodate one vertical side of the floating point unit (FPU) being filled with connection blocks (assumed to be as large as a “CLB”). This made the true area of the FPU dependent on the shape chosen. Latency of the FPUs was more difficult to estimate appropriately. The final testing of the architecture shows that embedded FPU improves the results to a great extent
Figure 4.Blodk diagram of FPU
4.2. EMBEDDED SHIFTER
As we had discussed earlier the mantissa have to be shifted for floating-point operations. The mantissa can be shifted either left or right, by any distance up to the full length of the mantissa. This means that up to a 24 bit shift can be required for IEEE single precision and up to 53 bits of shift can be required for IEEE double precision. However, in hardware, shifters tend to be implemented in powers of two. Therefore, shifters of length 32 and 64 bits were implemented for single and double precision floating-point operations, respectively. Even though floating-point operations only require a logical shift, the embedded shifter should be versatile enough to be used for a wider variety of applications. The embedded shifter that was used has five modes: shift left logical/arithmetic, rotate left, shift right logical, shift right arithmetic, and rotate right. During the shifting that accompanies the normalization of floating-point numbers it is necessary to calculate a sticky bit. The sticky bit is the result of the logical OR of all of the bits that are lost during a right shift, and it is an integral part of the shift operation. Adding the necessary logic to the shifter to compute the sticky bit increases the size of the shifter by less than 1%. Thus, the logic for the sticky bit calculation is included in each shifter. The sticky bit outputs are undefined when a shift other than a logical right shift is performed. The embedded shifter also has optional registers on the inputs and outputs of the data path. There are a total of 83 inputs and 66 outputs. The 83 inputs include 16 control bits, 64 data bits, and 3 register control bits (clock, reset, and enable). The 66 outputs include 64 data bits and 2 sticky bits (two independent sticky bit outputs are needed when the shifter is used as two independent 32-bit shifters). Internally, the combinational delay of the shifter was 1.52 ns, which is far from the limiting timing path. Total area of the shifter logic is 1.27 times the size of the CLB and its associated routing; however, this does not account for the area needed for the additional number of connection of the embedded shifter compared to the CLB or the area needed for connections to the routing structure.
Figure 5.Block diagram of embedded shifter
4.3. ADDITION OF 4:1 MULTIPLEXER
In the section about architecture of FPGA we have studied about the configurable logic block in detail. Now we are applying a fine grained approach to modify the internal structure of CLB. The figure shows a simplified version of one half of the baseline CLB (the lighter shaded blocks). The baseline CLB has two four-input LUTs (4-LUTs), two flip-flops, and some logic to support carry chains (the AND and XOR gates as well as the vertically oriented data path). This allows each half of the CLB to implement any four-input function, including one bit of an add or subtract, as well as some other more eclectic functions (e.g., the addition of either of two constants to an input based on a select bit). In modifying the CLB to better implement variable length shifters, two general principles were observed: minimize the impact on the architecture, and have no impact on general purpose routing. To accomplish these goals, the only change that was made to the CLB’s architecture was to add a single 4:1 multiplexer in parallel with each 4-LUT, as shown in Figure.
The multiplexer and LUT share the same four data inputs. The select lines for the multiplexer are the BX and BY inputs to the CLB. Since each CLB has two LUTs, each CLB would have two 4:1 multiplexers. Since there are only two select lines, both of the 4:1 multiplexers would have to share their select lines. However, for shifters and other large data path elements it is easy to find muxes with shared select inputs. The BX and BY inputs are normally used as the independent inputs for the D flip-flops, but are blocked in the new mux mode. However, the D flip-flops can still be driven by the LUTs in the CLB, and can be used as normal when not using the mux mode. It was determined that adding the 4:1 multiplexer increased the delay of the 4-LUT by only 1.83%. A 4:1 multiplexer was also laid out and simulated. The delay of the 4:1 multiplexer was 253 ps, which is less than the 270 ps that was determined for the 4-LUT .The addition of two 4:1 multiplexers to each CLB increases the size of the CLB by less than 0.5% .
A simplified view of bottom half of CLB is given in the figure below. Along with that the basic structure of FPGA board after the discussed modifications is also given.
Figure 6.Simplified representation of bottom half of modified CLB showing addition
Figure 7.Basic architecture Figure 8.Embedded shifters added
Figure 9.Embedded FPU replacing multiplier
5. TESTING METHODOLLOGY
VPR (Variable Place and Root) tool is used to test the feasibility of modifications.VPR is first modified to allow use of embedded blocks. It uses simulated annealing and a timing based semi perimeter routing estimate for placement and a timing driven detailed router. In previous versions, VPR supported only three types of circuit elements: input pads, output pads, and CLBs. To test the proposed architectural modifications and to incorporate the necessary architectural elements, VPR was modified to allow the use of embedded block units of parameterizable size. These embedded blocks have parameterizable heights and widths that are quantized by the size of the CLB. Horizontal routing is allowed to cross the embedded units, but vertical routing only exists at the periphery of the embedded blocks. The regular routing structure that existed in the original VPR was maintained. Additionally, a fast carry-chain was incorporated into the existing CLBs to insure a reasonable comparison with state-of-the-art devices.
Five benchmarks were used to test the feasibility of the proposed architectural modification. They were matrix multiply, matrix vector multiply, vector dot product, FFT, and a LU decomposition data path. All of the benchmarks use double-precision floating-point addition and multiplication. The LU decomposition also includes floating-point division, which must be implemented in the reconfigurable fabric for all architectures. Each benchmark was sized to be of comparable, though not identical, complexity and the devices were created to be large enough to accommodate the largest benchmark without resource sharing. To explore the impacts of our proposed modifications, the following five versions of each benchmark were created.
5.3. THE FIVE VERSIONS USED
• CLB Only: All floating-point operations are performed using the CLBs. The only other units in this version are embedded RAMs and input/output (I/O).
• Embedded Multiplier: This version adds 18-bit embedded multipliers to the CLB Only version. Floating point multiplication uses the CLBs and the embedded multipliers. Floating-point addition and division are performed using only the CLBs. This version is similar to the Xilinx Vertex II Pro family of FPGAs, and thus is representative of what is currently available in commercial FPGAs.
• Embedded Shifter: This version further extends the embedded multiplier version with embedded variable length shifters that can be configured as a single 64-bit variable length shifter or two 32-bit variable length shifters. Floating-point multiplication uses the CLBs, embedded multipliers, and embedded shifters. Floating-point addition and division are performed using the CLBs and Embedded shifters (for normalization shifting in both cases).
• Multiplexer: While the same embedded RAMs, embedded multipliers, and I/O of the embedded multiplier version are used, the CLBs have been slightly modified to include a 4:1 multiplexer in parallel with the LUTs. Floating-point multiplication uses the modified CLBs and the embedded multipliers. Floating-point addition and division are performed using only the modified CLBs.
• Embedded FPU: Besides the CLBs, embedded RAMs, and I/O of the CLB ONLY version, this version includes embedded floating-point units (FPUs). Each FPU performs a double-precision floating-point multiply-add. Other floating-point operations are implemented using the general reconfigurable resources.
The hardware description language, VHDL is used to write the floating-point benchmarks. At the high level, none of the benchmarks changed from one version to the other. Instead, the back-end tools were modified to remap specific blocks to the new technology. For example, for the embedded FPU, a special, recognizable macro was inserted in place of the electronic data interchange format (EDIF) instance of the floating-point unit built from CLBs. This macro was identified as the design entered the VPR flow and was replaced with the embedded floating-point unit. Similar techniques were used for the embedded shifter and multiplexer design pointsThe addition of the carry-chain was necessary to make a reasonable comparison between the different benchmark versions. Fast carry-chains were used and VPR was also modified for it. Along with the two 4 input function generators, two storage elements, and arithmetic logic gates, each CLB has a fast carry chain affecting two output bits. The carry-out of the CLB exits through the top of the CLB and enters the carry-in of the CLB above as shown in Fig. 8. Each column of CLBs has one carry chain that starts at the bottom of the column of CLBs and ends at the top of the column. Since each CLB has logic for two output bits, there are two opportunities in each CLB to get on or off of the carry-chain.
6. PERFORMANCE STUDY
Figure10.Bench mark-Clock rate
Figure 11.Bench mark-Area
Figure 12.Bench mark-Track count
6.1. EMBEDDED FPU
The embedded FPU had the highest clock rate, smallest area, and lowest track count of all the architectures. By adding embedded FPUs there was an average clock rate increase of 33.4%, average area reduction of 54.2%, and average track count reduction of 6.83% from the EMBEDDED MULTIPLIER to the EMBEDDED FPU versions. To determine the penalty of using an FPGA with embedded FPUs for non-floating-point computations, the percent of the chip that was used for each component was calculated. For the chosen FPU configuration, the FPUs consumed 17.6% of the chip. This is an enormous amount of “wasted” area for many applications and would clearly be received poorly by that community; however, this generally mirrors the introduction of the PowerPC to the Xilinx architecture.
6.2. EMBEDDED SHIFTERS
Even with a conservative size estimate, adding embedded shifters to modern FPGAs significantly reduced circuit size. Adding embedded shifters increased the average clock rate by 3.3% and reduced the average area by 14.6% from the EMBEDDED MULTIPLIER to the EMBEDDED SHIFTER versions. Even though there was an average increase in the track count of 16.5%, a track count of 58 is well within the number of routing tracks on current FPGAs. Only the floating-point operations were optimized for the embedded shifters—the control and remainder of the data path remained unchanged. If we consider only the floating-point units, the embedded shifters reduced the number of CLBs for each double-precision floating-point addition by 31% while requiring only two embedded shifters. For the double-precision floating-point multiplication the number of CLBs decreased by 22% and required two embedded shifters.
6.3. MODIFIED CLBS WITH ADDITIONAL 4:1 MULTIPLEXERS
Using the small modification to the CLB architecture showed surprising improvements. Even though only the floating-point cores were optimized with the 4:1 multiplexers, there was an average clock rate increase of 11.6% and average area reduction of 7.3% from the EMBEDDED MULTIPLIER to the MULTIPLEXER versions. The addition of the multiplexer reduced the size of the double-precision floating-point adder by 17% and reduced the size of the double-precision multiplier by 10%. Even though there was an average increase in the track count of 16.1%, a track count of 58 is well within the number of routing tracks on current FPGAs
Three architectural modifications to make floating point operations more efficient are demonstrated. The modifications introduced are adding complete Double-precision floating point multiply-add units, adding embedded shifters, and adding a 4:1 multiplexer in parallel to the LUT, each provide an area and clock rate benefit over traditional approaches with different tradeoffs. Three levels of approaches were applied, the fine grained approach, coarse grained approach and a middle level of both these. At the most coarse-grained end of the spectrum is a major architectural change that consumes significant chip area, but provides a dramatic advantage. Addition of embedded FPU is the coarse grained level. The embedded FPUs provided an average reduction in area of 54.2% compared to an FPGA enhanced with embedded 18-bit x18-bit multipliers. This area achievement is in addition to an average speed improvement of 33.4% over using the embedded 18-bit x18-bit multipliers. There is even an average reduction in the number of routing tracks required by an average of 6.8%. The embedded shifter provided an average area savings of 14.3% and an average clock rate increase of 3.3%, which is the intermediate level of approach. At the finest grain end of the spectrum, adding a 4:1 multiplexer in the CLBs provided an average area savings of 7.3% while achieving an average speed increase of 11.6%. The surprising fact is that the smaller change to the FPGA architecture amounts to the bigger net “win.”
1. Michael J. Beauchamp, Scott Hauck, Keith D. Underwood, and K. Scott Hemmert, “Architectural Modifications to Enhance the Floating-Point Performance of FPGAs” IEEE transactions on very large scale integration (VLSI) systems, vol. 16, no. 2, February 2008,pp 177-187
2. K. S. Hemmert and K. D. Underwood, “An analysis of the double precision floating-point FFT on FPGAs,” in Proc. IEEE Symp. FPGA Custom Comput. Mach., 2005, pp. 171–180.
3. K. S. Hemmert and K. D. Underwood, “Open source high performance
floating- point modules,” in Proc. IEEE Symp. FPGAs Custom Comput.
Mach., 2006,pp. 349–350.
Use Search at http://topicideas.net/search.php wisely To Get Information About Project Topic and Seminar ideas with report/source code along pdf and ppt presenaion
science projects buddy|
Active In SP
Joined: Dec 2010
26-12-2010, 10:25 AM
FT-IR stands for Fourier Transform Infrared, the preferred method of infrared spectroscopy. In infrared spectroscopy, IR radiation is passed through a sample. Some of the infrared radiation is absorbed by the sample and some of it is passed through (transmitted). The resulting spectrum represents the molecular absorption and transmission, creating a molecular fingerprint of the sample. Like a fingerprint no two unique molecular structures produce the same infrared spectrum. Many of the volatile organic compounds of interest to the environmental, military, and industrial hygiene communities have characteristic spectral features that can be used for identification and quantification. An FTIR (Fourier Transform Infrared) is a method of obtaining infrared spectra by first collecting an interferogram of a sample signal using an interferometer, then performing a Fourier Transform on the interferogram to obtain the spectrum. An FTIR spectrometer is a spectral instrument that collects and digitizes the interferogram, performs the Fourier transform function and displays the spectrum. The resulting spectrum is then processed by the spectral signal processing software which also controls the moving-mirror in the system to move.
A rapid scan spectrometer had been invented by Gao Zhan and Xiang Libin, but the spectrometer did not contain signal processing part. A scanning FTIR system was designed by Harig R. to detect airborne pollutants, but spectral signal processing software of the system was based on PC platforms in which a PC was used to control the whole electronic system. However, in some practical applications, PC based processors cannot be used and hence the SSPS has to be designed with embedded processors. Traditional PC based FTIR spectral signal processing software is unable to meet the demands in real-time applications. An embedded processor, PC104, instead of PC, is used in this system to control spectrometer. A new kind of spectral signal processing module, which uses a high-speed floating point Digital Signal Processor (DSP) as its central processing unit, is devised for this spectrometer. This module is called Spectral Signal Processing System (SSPS). Details on hardware and software design solutions of this SSPS module are discussed in later sections.
Design of a High-speed Spectral Signal Processing System with a Floating-point1.docx (Size: 462.69 KB / Downloads: 50)
2. WHY INFRARED SPECTROSCOPY?
Infrared spectroscopy has been a workhorse technique for materials analysis in the laboratory for over seventy years. An infrared spectrum represents a fingerprint of a sample with absorption peaks which correspond to the frequencies of vibrations between the bonds of the atoms making up the material. Because each different material is a unique combination of atoms, no two compounds produce the exact same infrared spectrum. Therefore, infrared spectroscopy can result in a positive identification (qualitative analysis) of every different kind of material. In addition, the size of the peaks in the spectrum is a direct indication of the amount of material present. With modern software algorithms, infrared is an excellent tool for quantitative analysis.
The information that FT-IR can provide are:
• It can identify unknown materials
• It can determine the quality or consistency of a sample
• It can determine the amount of components in a mixture
3. OLDER TECHNOLOGY
The original infrared instruments were of the dispersive type. These instruments separated the individual frequencies of energy emitted from the infrared source. This was accomplished by the use of a prism or grating. An infrared prism works exactly the same as a visible prism which separates visible light into its colors (frequencies). A grating is a more modern dispersive element which better separates the frequencies of infrared energy. The detector measures the amount of energy at each frequency which has passed through the sample. This results in a spectrum which is a plot of intensity vs. frequency.
Fourier transform infrared spectroscopy is preferred over dispersive or filter methods of infrared spectral analysis for several reasons:
• It is a non-destructive technique
• It provides a precise measurement method which requires no external calibration
• It can increase speed, collecting a scan every second
• Increase sensitivity-one second scans can be co-added together to ratio out random noise
• It has greater optical throughput
• It is mechanically simple with only one moving part
4. WHY FT-IR?
Fourier Transform Infrared (FT-IR) spectrometry was developed in order to overcome the limitations encountered with dispersive instruments. The main difficulty was the slow scanning process. A method for measuring all of the infrared frequencies simultaneously, rather than individually, was needed. A solution was developed which employed a very simple optical device called an interferometer. The interferometer produces a unique type of signal which has all of the infrared frequencies “encoded” into it. The signal can be measured very quickly, usually on the order of one second or so. Thus, the time element per sample is reduced to a matter of a few seconds rather than several minutes. Most interferometers employ a beam splitter which takes the incoming infrared beam and divides it into two optical beams. One beam reflects off of a flat mirror which is fixed in place. The other beam reflects off of a flat mirror which is on a mechanism which allows this mirror to move a very short distance (typically a few millimeters) away from the beam splitter. The two beams reflect off of their respective mirrors and are recombined when they meet back at the beam splitter.
Because the path that one beam travels is a fixed length and the other is constantly changing as its mirror moves, the signal which exits the interferometer is the result of these two beams “interfering” with each other. The resulting signal is called an interferogram which has the unique property that every data point (a function of the moving mirror position) which makes up the signal has information about every infrared frequency which comes from the source. This means that as the interferogram is measured; all frequencies are being measured simultaneously. Thus, the use of the interferometer results in extremely fast measurements. Because the analyst requires a frequency spectrum (a plot of the intensity at each individual frequency) in order to make identification, the measured interferogram signal cannot be interpreted directly. A means of “decoding” the individual frequencies is required. This can be accomplished via a well-known mathematical technique called the Fourier transformation. This transformation is performed by the computer which then presents the user with the desired spectral information for analysis.
5. THE SAMPLE ANALYSIS PROCESS
Figure 2: The sample analysis process
The normal instrumental process is as follows:
1. The Source: Infrared energy is emitted from a glowing black-body source. This beam passes through an aperture which controls the amount of energy presented to the sample (and, ultimately, to the detector).
2. The Interferometer: The beam enters the interferometer where the “spectral encoding” takes place. The resulting interferogram signal then exits the interferometer.
3. The Sample: The beam enters the sample compartment where it is transmitted through or reflected off of the surface of the sample, depending on the type of analysis being accomplished. This is where specific frequencies of energy, which are uniquely characteristic of the sample, are absorbed.
4. The Detector: The beam finally passes to the detector for final measurement. The detectors used are specially designed to measure the special interferogram signal.
5. The Computer: The measured signal is digitized and sent to the computer where the Fourier transformation takes place. The final infrared spectrum is then presented to the user for interpretation and any further manipulation.
6. THE INTERFEROMETER
The interferometer used is the Michelson interferometer. It is the most common configuration for optical interferometry. An interference pattern is produced by splitting a beam of light into two paths, bouncing the beams back and recombining them. The different paths may be of different lengths or be composed of different materials to create alternating interference fringes on a back detector.
Figure 3: Path of light in Michelson interferometer
Michelson interferometer consists of two highly polished mirrors, a source that emits monochromatic light that hits the surface at point and a beam splitter. The beam splitter splits the wave into two. One travels towards the moving mirror and other towards the fixed mirror. After reflecting from these mirrors, both beams recombine at the beam splitter to produce an interference pattern (assuming proper alignment) visible to the observer at the detector point.
There are two paths from the (light) source to the detector. One reflects off the semi-transparent mirror, goes to the top mirror and then reflects back, goes through the semi-transparent mirror, to the detector. The other first goes through the semi-transparent mirror, to the mirror on the right, reflects back to the semi-transparent mirror, then reflects from the semi-transparent mirror into the detector. The principle is when a parallel beam of light coming from a monochromatic extended light source is incident on a half silvered glass plate, it is divided into two beams of equal intensities by partial reflection and transmission.
Figure 4: The interference pattern
Both beams are coherent. In this experiment coherent waves are thus produced by the method of division of amplitude. If these two paths differ by a whole number (including 0) of wavelengths, there is constructive interference and a strong signal at the detector. If they differ by a whole number and half wavelengths (e.g., 0.5, 1.5, 2.5 ...) there is destructive interference and a weak signal. This might appear at first sight to violate the principle of conservation of energy. However energy is conserved, because there is a redistribution of energy at the detector in which the energy at the destructive sites is re-distributed to the constructive sites. The effect of the interference is to alter the share of the reflected light which heads for the detector and the remainder which heads back in the direction of the source.
The interferogram belongs in the length domain. Fourier transform (FT) inverts the dimension, so the FT of the interferogram belongs in the reciprocal length domain that is the wave number domain. The spectral resolution in wave numbers per cm is equal to the reciprocal of the maximum retardation in cm. The use of corner-cube mirrors in place of the flat mirrors is helpful as an outgoing ray from a corner-cube mirror is parallel to the incoming ray, regardless of the orientation of the mirror about axes perpendicular to the axis of the light beam.
7. CHALLENGES OF HIGH-SPEED DSP DESIGN
Today’s digital signal processors (DSPs) are typically run at a 1GHz internal clock rate while transmit and receive signals to and from external devices operate at rates higher than 200MHz. These fast switching signals generate a considerable amount of noise and radiation, which degrades system performance and creates electromagnetic interference (EMI) problems that make it difficult to pass tests required to obtain certification from the Federal Communication Commission (FCC). Good high-speed system design requires robust power sources with low switching noise under dynamic loading conditions, minimum crosstalk between high-speed signal traces, high- and low-frequency decoupling techniques, and good signal integrity with minimum transmission line effects. This document provides recommendations for meeting the many challenges of high-speed DSP system design.
8. SSPS SYSTEM OVERVIEW
SSPS (Spectral Signal Processing System) need to store large amount of interferogram data, effective processing and identification algorithms, process the stored interferogram data of FTIR system in real-time, and send interferogram data and results to host computer or controller for further processing or to control the electronic and mechanical parts (moving mirror) effectively.
SSPS needs two main abilities:
1. The ability to process interferogram data with high speed.
2. The ability to receive and send interferogram data with high speed.
9. HARDWARE DESIGN OF SSPS
The hardware architecture of the SSPS platform is based on a high-speed, floating-point DSP processor. The hardware architecture is divided into the following main parts.
1. Signal Processing Block
TMS320C6713 DSP Processor
2. Data Transmission Block
Multi channel Buffered Serial Port (McBSP)
Dual port RAM
3. Memory Part
Dual Port RAM
Synchronous Dynamic RAM (SD RAM)
Synchronous Pipelined Cache RAM (SPC RAM)
4. Other devices
+5V power supply
Complex Programmable Logic Device (CPLD)
Figure 6: The hardware architecture of SSPS
9.1 SIGNAL PROCESSING BLOCK
The signal processing block is used to analyze the data acquired by FTIR spectrometer automatically, with real-time algorithms. The TMS320C6713 (C6713), which is a high speed floating point DSP processor, is chosen as the central processing unit of this SSPS platform. In this system, the floating-point data sampled by FTIR spectrometer is transmitted to C6713 through dual-port RAM, and stored in the memory part. C6713 is operating at 225MHz, and delivers 1350 million floating-point operations per second. Firstly, interferogram is transformed into a spectrum via Fast Fourier Transform (FFT), processed using some technical methods and the result is obtained using pattern recognition algorithm. All of the real-time algorithms, including FFT, digitally filtering, baseline correction, wave-number correction, phase correction, and pattern recognition, are implemented in floating-point mode.
9.2 DATA TRANSMISSION BLOCK
In spite of being almost completely physically embedded in the DSP, special reference should be made to the interface capabilities of the module. There are two types of interfaces for the communications between SSPS and controller. They are multichannel buffered serial port (McBSP) and Dual port RAM. Two multi-channels buffered serial ports are used for the communication between SSPS and host computer. In this system, McBSPs are configured to the Universal Asynchronous Receiver/Transmitter (UART) standard, which is a well-established protocol for the exchange of serial data.
To interface a UART to the McBSP in serial port mode, the UART’s transmit data line is connected to both the data input and the frame synchronization input on the McBSP. This is because the UART serial data line contains both framing and data information. The UART’s receive data line is connected to the data output of the McBSP. Figure 7 illustrates the UART to McBSP connection. In order to interface McBSP to the RS-232 port of the host computer, the data signal needs to go through a RS-232 converter to translate from the CMOS logic levels to the RS-232 logic levels. A MAX488 device is used in SSPS as the logic levels converter. SSPS gets the interferogram data from spectrometer by Dual-Port RAM. Dual-Port RAM is not only a memory device, but also a data transmission interface in SSPS.
9.3 MEMORY PART
The memory part is used to store blocks of code and data of SSPS. Despite of the internal memory of TMS320C6713 DSP, there are four kinds of external memory devices:
1. Synchronous Dynamic RAM (SDRAM)
SDRAM is dynamic random access memory (DRAM) that has a synchronous interface. SDRAM has a synchronous interface, meaning that it waits for a clock signal before responding to control inputs and is therefore synchronized with the computer's system bus.
2. Synchronous-Pipelined Cache RAM
Cache memory (also called buffer memory) is local memory that reduces waiting times for information stored in the RAM (Random Access Memory).
3. NOR Flash Memory
NOR flash memory is a type of non-volatile storage technology that does not require power to retain data.
4. Dual-port RAM
Dual port RAM has ability to simultaneously read and write different memory cells at different addresses.
The clock to four synchronous devices is given by the internal Phase-Locked-Loop (PLL) of DSP. All of the interferogram and spectrum is stored and processed in floating-point style. The original interferogram sampled by FTIR spectrometer is stored in Dual-port RAM, with floating-point style. Dual-port RAM is a memory shared by SSPS and ADC controller. Both of SSPS and spectrometer can read and write data to Dual-port RAM.
9.4 OTHER DEVICES
The SSPS operates from a single +5V external power supply connected to the main power input. Internally, the +5V input is converted to +1.26V and +3.3V using separate voltage regulators. The +1.26V supply is used for the DSP core while the +3.3V supply is used for the I/O buffers of DSP and all other chips on the board. A Complex Programmable Logic Device (CPLD) device is used to implement functionality specific to SSPS. The CPLD has a register based user interface that lets the user configure the board by reading and writing to its registers.
10. SOFTWARE DESIGN
10.1 SOFTWARE ARCHITECTURE
The system software can be divided into four parts: data transmission software, signal processing software, chip support library, and registers configuration program. The data transmission software deals with the data to be transmitted to the controller or host computer and monitors the data received from the controller or dual port RAM. The chip support library is a collection of functions, macros and symbols used to configure and operate the on-chip peripheral modules. The signal processing software processes the interferogram and transform it into a spectrum through different technical methods like digital filtering, Apodization, baseline correction, pattern recognition etc. The registers of the DSP processor and the complex programmable logic device are configured using the register configuration program.
Figure 8: Software architecture of SSPS
10.2 SIGNAL PROCESSING METHOD
Interferogram data has to be processed in real-time in order to get a result automatically. Figure 7 illustrates the signal processing layout of SSPS.
Figure 9: Signal processing method of SSPS
The different steps in signal processing are the following.
1. FOURIER TRANSFORMATION
Firstly, once interferogram have been collected, signal averaged, and stored, the next step is usually the transformation of the data to a spectrum. The interferogram and spectrum should be regarded as the complex pair. The complex inverse Fourier transform of the spectrum to produce the interferogram is shown in the following equation, where B (∆) in equation refers to spectrum, while I (∆) refer to interferogram.
2. PHASE CORRECTION
Ideally, the interferogram is symmetrical about the zero path difference. Due to optical, electronic, or sampling effects, an additional term called phase error has to be added to the phase angle. As a result, sine components are introduced into the interferogram. The process of removing the phase error, which is called phase correction, is done in SSPS. Phase correction is performed by removing sine components from an interferogram. Usually part of the double-sided interferogram is used for correction.
Because the interferogram cannot be collected from t = -¥ to +¥, and is truncated, some error arises in the resulting spectrum: the line is broadened with side-lobes. An Apodization function is applied to correct the spectral line shape, by weighting the points collected in the interferogram. Boxcar truncation gives no Apodization and the narrowest lines. Common Apodization functions include Beer-Norton, Cosine and Happ-Genzel.
4. BACKGROUND SUBTRACTION
Because there needs to be a relative scale for the absorption intensity, a background spectrum must also be measured. This is normally a measurement with no sample in the beam. This can be compared to the measurement with the sample in the beam to determine the “percent transmittance.” This technique results in a spectrum which has all of the instrumental characteristics removed. Thus, all spectral features which are present are strictly due to the sample. A single background measurement can be used for many sample measurements because this spectrum is characteristic of the instrument itself.
5. BASELINE CORRECTION
It is usual in quantitative infrared spectroscopy to use a baseline joining the points of lowest absorbance on a peak, preferably in reproducibly flat parts of the absorption line. The absorbance difference between the baseline and the top of the band is then used. An example for baseline correction is shown in figure.
6. PATTERN RECOGNITION
At last, pattern recognition algorithms are used to mathematically determine the presence or absence of a target analyses. The pattern recognition inputs are the filtered spectra or interferogram and the outputs are the predicted classification of the pattern.
10.3 SOFTWARE IMPLEMENTATION
The majority of the SSPS software is written in C, using Code Composer Studio (CCS), which provides an integrated development environment to incorporate the software tools. CCS includes tools for code generation, such as a C compiler, an assembler, and a linker. For optimum performance, it is necessary to write some of the code in assembly language. Interrupt service routines that interact with DSP and ADC, is written in assembly.
The C source program is compiled by the C compiler with extension .c to produce an assembly source file with extension .asm. The assembler assembles an .asm source file to produce a machine language object file with extension .obj. The linker combines object files and object libraries as input to produce an executable file with extension .out. This executable file can be loaded and run directly on the C6713 processor.
An emulator is used for debugging the software. Attached to the JTAG port on the SSPS board, the emulator allows the software developer to examine all of the major components in the SSPS, including the C6713 registers, data and program memory, interface controller, and digital I/O controller. The emulator can also be used to load application software, and set breakpoints to halt execution when certain memory spaces are accessed.
11. ADVANTAGES OF FT-IR
Some of the major advantages of FT-IR over the dispersive technique include:
• Speed: Because all of the frequencies are measured simultaneously, most measurements by FT-IR are made in a matter of seconds rather than several minutes. This is sometimes referred to as the Felgett Advantage.
• Sensitivity: Sensitivity is dramatically improved with FT-IR for many reasons. The detectors employed are much more sensitive, the optical throughput is much higher (referred to as the Jacquinot Advantage) which results in much lower noise levels, and the fast scans enable the co addition of several scans in order to reduce the random measurement noise to any desired level (referred to as signal averaging).
• Mechanical Simplicity: The moving mirror in the interferometer is the only continuously moving part in the instrument. Thus, there is very little possibility of mechanical breakdown.
• Internally Calibrated: These instruments employ a HeNe laser as an internal wavelength calibration standard (referred to as the Connes Advantage). These instruments are self calibrating and never need to be calibrated by the user.
These advantages, along with several others, make measurements made by FT-IR extremely accurate and reproducible. Thus, it a very reliable technique for positive identification of virtually any sample. The sensitivity benefits enable identification of even the smallest of contaminants. This makes FT-IR an invaluable tool for quality control or quality assurance applications whether it be batch-to-batch comparisons to quality standards or analysis of an unknown contaminant. In addition, the sensitivity and accuracy of FT-IR detectors, along with a wide variety of software algorithms, have dramatically increased the practical use of infrared for quantitative analysis. Quantitative methods can be easily developed and calibrated and can be incorporated into simple procedures for routine analysis.
Thus, the Fourier Transform Infrared (FT-IR) technique has brought significant practical advantages to infrared spectroscopy. It has made possible the development of many new sampling techniques which were designed to tackle challenging problems which were impossible by older technology. It has made the use of infrared analysis virtually limitless.
Figure 13 shows an application of this SSPS module in a FTIR spectrometer, which is designed for a real-time application system. An embedded processor, PC104, is used as the Central Processing Unit of this spectrometer. PC104 controls the moving-mirror to move, and collects the interferogram sampled by ADC device at the same time. Interferograms are then given to SSPS through dual-port RAM, by the way that C6713 and PC104 share the memory of dual-port RAM. The SSPS processes the data, using the signal processing methods mentioned above, and then gives a result of whether the target analytes is present or absent. C6713 works at 200MHz in this system. When the system works, FTIR spectrometer collects 10 interferograms per second and puts them into dual-port RAM. SSPS processes the interferograms with algorithms mentioned above and sends the results to the controller. Here are some timing-characters tested in experiments.
Figure 13: The SSPS module used in a spectrometer
This module can also be used as an effective signal processing platform for other real-time systems which need high-speed signal processing, such as
1. Speech recognition systems
2. Remote medical diagnostics
5. Imaging systems
In these cases, some signal processing algorithms in SSPS may be changed, and some other algorithms should be added to SSPS.
In this paper, the design and implementation of a novel FTIR spectral signal processing system are described. Both hardware and software of this SSPS module are designed. The use of high-speed floating point DSP in the module effectively accounts for the overall increase of flexibility and performance of FTIR spectrometer. A dual-port RAM is used in the system for vast data transmission. Three kinds of memory are chosen for SSPS to store interferogram and spectral data. In order to process the interferogram data in real time, three steps of effective signal processing methods are designed for SSPS. The software of SSPS is written in C language and assembly language, compiled by the CCS compiler, and stored in FLASH memory. The applications show that the proposed SSPS works quite well for spectral signal processing in FTIR spectrometer.
This module can also be used as an effective signal processing platform for other real-time systems which need high-speed signal processing, such as speech recognition systems, remote medical diagnostics, audio, radar, imaging systems, etc. In these cases, some signal processing algorithms in SSPS may be changed, and some other algorithms should be added to SSPS.
 DONG Da-ming, FANG Yong-hua, XIONG Wei, LAN Tian-ge,”Design of a high
Speed spectral Signal processing system with floating point DSP for FTIR spectroscopy”, The Ninth International Conference on Electronic Measurements and Instruments (ICEMI),vol.1, pp.35-39, 2009
 Paduart Johan, Schoukens Johan, Rolain Yves ."Fast measurement of quantization
distortions in DSP algorithms", IEEE Transactions on Instrumentation and Measurement, vol.5, pp.1917-1923, October 2007
 Donatus ."DSP-based real-time implementation of a hybrid H infinity adaptive fuzzy
racking controller for servo-motor drives“, IEEE Transactions on Industry Applications,
vol 2, pp.476-484, March/April 2007
Use Search at http://topicideas.net/search.php wisely To Get Information About Project Topic and Seminar ideas with report/source code along pdf and ppt presenaion
|Popular Searches: oltp benchmark, computer floating point number seminar, floating point alu, seminar class architectural presentation, 3 cpld, 6 stroke engine modifications, safety topic for architectural work,|