In design situations where optimum performance and/or power consumption is required, implementing digital signal processing functions in dedicated hardware versus software becomes an attractive proposition. A FPGA is a particularly compelling silicon platform for realizing this aspiration, because it conceptually combines the inherent hardware attributes of an ASIC with the flexibility and time-to-market advantages of the software alternative running on a CPU, GPU or DSP. As such, FPGAs are increasingly finding use as parallel processing engines for demanding digital signal processing applications. Benchmark results show that on highly parallelizable workloads, FPGAs can achieve very strong performance and performance/cost metrics compared to DSPs and CPUs.
However, to date, FPGAs have been used almost exclusively for fixed-point digital signal processing functions. Although FPGA vendors have long offered floating-point primitive libraries, the performance of FPGAs in floating-point applications has historically been very limited. The inefficiency of traditional floating-point FPGA designs is partially due to the deeply pipelined nature and wide arithmetic structures of the floating-point operators, which create large data path latencies and routing congestion. In turn, the latencies can create hard-to-manage problems in designs with high data dependencies. The final result is often a design with a low operating frequency. Moore's Law trends have benefitted FPGAs, leading to devices with more abundant logic (both generic and function-specific) and routing resources, made up of faster-switching and cooler-running transistors. These silicon improvements are only meaningful, though, if design software enables users to leverage them in a straightforward manner.
The traditional FPGA design flow, based on writing register-transfer-level hardware descriptions in Verilog or VHDL, is not well suited to implementing complex floating-point algorithms. Addressing this issue, Altera has developed a new design flow intended to simplify the process of implementing floating-point digital signal processing algorithms on the company's FPGAs, and to enable those designs to achieve higher performance and efficiency than previously possible. Rather than building a data path consisting of elementary floating-point operators (multiplication followed by addition followed by squaring, for example), the floating-point compiler included in Altera's design flow generates a fused data path that combines elementary operators into a single function or data path (Figure 1). In doing so, it eliminates the redundancies present in traditional floating-point FPGA designs. In addition, the Altera design flow is a high-level approach based on Simulink. Altera intends that FPGA designers will, by working at a high level of abstraction, be able to implement and verify complex floating-point algorithms more quickly than is possible with traditional HDL-based design.
Figure 1. The fused datapath compiler included in Altera's DSP Builder Advanced Blockset-based tool set eliminates the inter-operator redundancy of a traditional two-adder chain design (a), removing the need to de-normalize the output of the first adder and normalize the input of the second adder, thereby leading to reduced logic usage and higher performance (b).
Specifically, Altera's floating-point digital signal processing design flow incorporates the Altera DSP Builder Advanced Blockset, Altera’s Quartus II RTL tool chain, ModelSim simulator, and MathWorks’ MATLAB and Simulink tools. This approach allows the designer to work at the algorithmic behavioral level, within the Simulink environment. The tool chain combines algorithm modeling and simulation, RTL generation, synthesis, place-and-route, and design verification stages. This integration enables quick development and rapid design space exploration both at the algorithmic level and at the FPGA device level, ultimately reducing overall design effort. Once a designer has modeled and debugged the algorithm at a high level, design synthesis and targeting to any Altera FPGA device are then possible.
In order to evaluate the capabilities of this tool set, along with those of various target silicon implementation platforms, Altera used the DSP Builder Advanced Blockset development flow to develop a complex single-precision IEEE 754 floating-point Cholesky solver design example on two different Altera FPGAs (the high-end Stratix IV EP4SE360H29C2 and the mid-range Arria II EP2AGX125DF25I3), whose implementations BDTI subsequently verified. Sets of linear equations of the form Ax = b arise in many applications. Whether the problem to be solved encompasses optimization of linear least squares, a Kalman prediction filter, or MIMO channel estimation, for example, it involves finding a numerical solution for a set of linear equations of the form Ax = b. And when the matrix A is symmetric and positive definite, which is true for the covariance matrices used in these and related problems, the Cholesky decomposition and solver commonly find use. This algorithm determines the inverse of matrix A, thus solving for vector x in x=A−1b.
Using the Cholesky solver implementation described in a just-published white paper, BDTI found the Altera Stratix IV EP4SE360H29C2 FPGA to be capable of performing 3,204 matrix inversions per second on matrices of size 240x240, running at a clock speed of 200 MHz. This result equates to over 45 billion floating-point multiplications per second at an accuracy that exceeds that of the single-precision IEEE 754 floating-point number representation. And BDTI achieved it without resorting to manual floor planning or any other hand-optimization tweaks. Starting from a high level block-based design in Simulink, the Altera-plus-MathWorks tool chain automatically pipelined, time-optimized and synthesized the design to achieve robust speeds and efficient resource utilization. However, this new block-based approach may also entail a significant learning curve, especially for a designer not familiar with MATLAB and Simulink. Designs using the DSP Builder Advanced Blockset are also currently limited to elements provided by it; elements from the standard DSP Builder Blockset are not optimized with the fused datapath compiler and cannot be mixed with the Advanced Blockset, nor can hand-coded HDL blocks be imported into the Advanced Blockset.
Altera will make the BDTI-verified Cholesky solver available as a design example packaged with the DSP Builder Advanced Blockset tool chain, beginning with toolset v11.1. For additional information on this project, including:
- A detailed description of the BDTI-verified Cholesky solver
- BDTI's in-depth usage impressions of the Altera DSP Builder Advanced Blockset-based tool set design flow
- Additional clock speed, throughput, and resource utilization results for other matrix sizes and for the mid-range Arria II EP2AGX125DF25I3 FPGA, and
- Error calculations for various single-precision floating-point Cholesky solver FPGA implementations as compared to double-precision floating-point references