michael_parker

Michael Parker

Altera - Sr Technical Manager, DSP

My responsibility is to provide the direction and detailed technical requirements used to develop Altera's DSP silicon architectures, DSP and floating point toolflows, IP cores and application specific reference designs.

Interests

EEWeb Stats

Michael's Blog :

Return to Blog

Software and Hardware Platforms Enable Over 1 TeraFlop Processing Rates

Computing applications have long used floating point numerical processing, including many in CPU architectures, which are mathematically superior and support wide dynamic ranges. However, most embedded applications have traditionally used fixed point processing. Despite significantly increasing development complexity (often three times the time of floating point development), fixed point microprocessors, DSPs, and FPGAs can generally provide lower power consumption, lower costs, and in the case of FPGAs, much higher processing rates.

A new FPGA-based floating point flow is available that allows for the same high processing rate as enjoyed by fixed point applications to be achieved in floating point applications. A floating point co-processor which can be tightly coupled to FPGA hardware is also newly available, allowing both hardware and software floating point data processing to be leveraged. In addition, both of these new capabilities still support high throughput, fixed point processing for the parts of the DSP datapath that do not need the dynamic range of floating point processing. The result is a processing platform that provides the advantages of both floating point and fixed point processing, while providing the flexibility to seamlessly partition and optimize the implementation between hardware and software.

Parallelism is a key advantage of a hardware solution like FPGAs, but it is often not applied to floating point signal processing because long latencies make the data dependencies in algorithms such as matrix decomposition difficult to manage. Therefore, the resultant systems offered poor performance levels and were uncompetitive with other platforms such as GPU or multi-core CPU architectures.

Altera has developed a floating point design flow that overcomes these issues. Rather than building a datapath from individual operators, the entire datapath is considered as a single function, with inter-operator redundancy factored out. Mantissa representation can be converted to hardware-friendly twos complement, and mantissa widths extended to reduce the frequency of normalizations. Elementary functions can be implemented as much as possible using hard multipliers, which offer guaranteed internal routing and timing, as well as low power and latency. New techniques can be applied for matrix decompositions, with the algorithms restructured to remove most of the data dependencies, so that parallel – and therefore high latency – datapaths can be used for these computations.

This approach is known as “Fused Datapath,” and when combined with a new 28nm Variable Precision DSP block architecture, offers extremely high data processing capabilities, in excess of one TeraFLOPS on a single FPGA die. The Fused Datapath technology has been embedded in Altera’s DSPBuilder design suite, which allows the full simulation and system design capabilities of Mathworks Matlab and Simulink to be utilized. This FPGA innovation in high-performance floating point enables the parallel hardware architecture advantages to be used in the very highest performance applications where the dynamic range of floating point is required.

An example of the matrix inversion processing capability with the latest floating point Cholesky matrix processing design is shown in Figure 1.

Figure:1 FPGA-based Floating Point Processing Throughput Example

Figure 1  FPGA-based Floating Point Processing Throughput Example

For more information on Altera’s FPGA floating point design flow using Altera’s DSPBuilder Advanced Blockset and Mathwork’s Simulink, please refer to the recent BDTI whitepaper and toolflow evaluation available at http://www.altera.com/floatingpoint.

Most floating point applications are currently implemented in software. With this new FPGA design flow now offering extremely high processing rates, a new architecture can be conceived that uses a tightly coupled C-programmable engine as a co-processor to the FPGA, rather than just the reverse. The FPGA can implement the repetitive, high GFLOPS portions of the algorithm, while the co-processor can deal with the more complicated and data dependent algorithms. This approach would combine the performance advantages of hardware implementations with the ease of development of software implementations.

The new Anemone floating point processor from BittWare connects to the FPGA via high-rate, low latency link ports. All access to off-chip memory is through the FPGA, as are off-board interfaces, such as PCIe backplanes or Ethernet ports. The Anemone processor is a multi-core design, currently offering sixteen cores per chip, all interconnected in a mesh network with a shared memory model. Each core has 32 Kbytes of local memory, supports IEEE-754 floating point processing, and are individually programmable using ANSI-C. The sixteen core Anemone chip offers 32 GFLOPS, while consuming only two watts of total power. Four Anemone chips, providing 128 GFLOPS, are available on an FMC (VITA 57) standard daughter card for use on FPGA host boards such as AMC, PCIe, and VPX. These are available today with Altera high end Stratix IV FPGAs, as shown in Figure 2, and will be offered later this year with Stratix V FPGAs.

Figure:1 Anemone-Stratix High Performance Floating Point Processing System featuring an AAFM co-processing mezzanine on an S4-3U-VPX host board from BittWare

Figure 2  Anemone-Stratix High Performance Floating Point Processing System featuring an AAFM co-processing mezzanine on an S4-3U-VPX host board from BittWare

The Anemone-to-FPGA interface is made transparent to the application using BittWare’s ATLANTiS FrameWork, which can bolt up seamlessly to Altera’s QSys FPGA system interconnect tool. This facilitates optimal partitioning of processing tasks between the Anemone and FPGA. With up to one TeraFLOPS of hardware floating point processing on Stratix V FPGAs, and 128 GLOPS of software floating point processing on Anemone, extremely high computational rate applications can be implemented in a low form factor, low power consumption platform.

For more information on BittWare’s Anemone processor and COTS FPGA boards, click here.

An example application might be high-performance airborne radar systems. The FPGA can implement the digital downconversion, beamforming, MTI filtering, Doppler FFT processing, pulse compression, and matrix inversions needed in space-time adaptive processing (STAP). The Anemone processor is ideal for lower GFLOPs but more complex tasks. Examples of this are CFAR detection processing, computing beam forming coefficients, adapting and controlling radar modes, and transmit waveform generation. Low latency between the processing sub-systems is essential, and these requirements are not easily met with GPU or CPU architectures. The combination of Anemone and Stratix FPGAs offer an ideal balance of TeraFlops processing power, flexibility to partition across hardware and software implementation, high GLOPS/Watt, and a very compact form factor.

This combination can also be ideal for any embedded application requiring high-performance computing power in military, medical imaging, wireless, or test equipment applications. Through the choice of FPGA and number of Anemone chips, the design can easily scale the level of processing power. The availability of Anemone-Stratix systems on BittWare’s COTS boards and systems supports rapid product development cycles.

Tags: couple, software, hardware development, processing,

Comments on this post:

There are currently no comments.

Login or Register to post comments.
 
Click Here