Included in an article on FPGA benchmarking in the September 2011 edition of InsideDSP, BDTI wrote:
In design situations where optimum performance and/or power consumption is required, implementing digital signal processing functions in dedicated hardware versus software becomes an attractive proposition. A FPGA is a particularly compelling silicon platform for realizing this aspiration, because it conceptually combines the inherent hardware attributes of an ASIC with the flexibility and time-to-market advantages of the software alternative running on a CPU, GPU or DSP. As such, FPGAs are increasingly finding use as parallel processing engines for demanding digital signal processing applications. Benchmark results show that on highly parallelizable workloads, FPGAs can achieve very strong performance and performance/cost metrics compared to DSPs and CPUs.
However, it continued:
To date, FPGAs have been used almost exclusively for fixed-point digital signal processing functions. Although FPGA vendors have long offered floating-point primitive libraries, the performance of FPGAs in floating-point applications has historically been very limited. The inefficiency of traditional floating-point FPGA designs is partially due to the deeply pipelined nature and wide arithmetic structures of the floating-point operators, which create large data path latencies and routing congestion. In turn, the latencies can create hard-to-manage problems in designs with high data dependencies. The final result is often a design with a low operating frequency.
Specialized FPGA toolsets such as Altera's DSP Builder, which BDTI evaluated in advance of publication of the 2011 article (white paper PDF here) and again for a March 2013 follow-up writeup (white paper PDFs here and here), strive to efficiently implement common floating-point DSP structures. But while Altera and its competitors have long included dedicated-function fixed-point DSP acceleration blocks in their FPGAs, floating-point operations have to date required the extensive use of generic programmable logic blocks to supplement the capabilities of fixed-point acceleration blocks. The result, as Altera's recently published artwork acknowledges, is a sub-optimal implementation both in terms of performance and silicon area efficiency (Figure 1):
Figure 1. Altera's existing FPGA product families can support floating-point calculations, albeit with non-ideal performance (top) and silicon area (bottom) due to the necessity to employ generic logic and routing resources
Altera understandably hasn't emphasized such shortcomings in the past. What's changed? As a July 2013 InsideDSP article mentioned, the company's forthcoming 14 nm Stratix 10 FPGAs, due out sometime next year, will embed "hard" floating-point DSP blocks for the first time (Figure 2):
Of particular note to digital signal processing algorithm implementers, Altera forecasts that Stratix 10 will deliver more than 10x the single-precision floating-point throughput of Stratix V, ranging up to greater than 10 TFLOPS (and 100 GFLOPS/watt). In part, this speed boost is inherently due to the two-node process migration, which increases both the DSP blocks' clock rates and the number of them that can be cost-effectively integrated on a single sliver of silicon. However, Altera has also more fully "hardened" the DSP block architecture; in today's Stratix V, slower generic FPGA logic is leveraged to implement a portion of floating-point functions. Keep in mind that the above performance claims are theoretical maximum estimates; BDTI's recent Stratix V floating-point evaluation (PDF) achieved 162 x 109 (i.e., 1.6 x 1011) FLOPS performance, versus Altera's 1 x 1012 FLOPS estimate for Stratix V. Architecture improvements that boost theoretical maximum single-precision floating-point calculation rates, however, should also increase real-life speeds to a tangible degree.
Figure 2. The enhanced DSP block architecture found in upcoming Stratix 10 (and, now, available Arria 10) FPGAs implements single-precision IEEE 754 arithmetic support
This function augmentation will, Altera believes, deliver multiple benefits to its military, high performance computing, "cloud" storage and other customers desiring flexible, high performance floating point facilities (Figure 3).
Figure 3. Fully hardware-implemented floating-point acceleration delivers performance (top), silicon area efficiency (middle) and time-to-market (bottom) advantages
And more recently, Altera announced that its hardened single-precision floating-point DSP blocks will be coming not only to next year's high-end Stratix 10 devices but also to its mainstream Arria 10 product line, fabricated on TSMC's 20 nm foundry process (Figure 4).
Figure 4. Speed estimates for under-development Stratix 10 devices are not yet public, but now-available Arria 10 samples notably accelerate the results achievable by Arria V and Stratix V precursors
This feature set expansion delivers advantages to Altera's current and potential new customers alike. If you're an existing Stratix V or Arria V user, with Arria 10 you’ll be able to leverage floating-point acceleration resources sooner and more cost-effectively than was previously feasible. And if you're planning to go into production with Stratix 10, the Arria 10 products' package and pinout compatibility will enable earlier hardware prototyping than Stratix 10's silicon schedule would otherwise allow (Figure 5).
Figure 5. Pinout and package compatibility enable you to do speed- and size-restricted design prototyping with Arria 10 now, with a subsequent migration to Stratix 10 for production next year
Speaking of silicon schedules, we now have a better (albeit still somewhat nebulous) idea of both Arria 10 and Stratix 10's status. Last July's article noted:
Altera intends to have initial Stratix 10 test chips in hand by the end of this year, coinciding with Intel's planned production of 14 nm devices. Early-access development software support for Stratix 10 lead customers will follow sometime next year, ahead of sample availability. With Arria 10, early-access design software support is already available; initial product samples are currently forecasted to appear in early 2014.
April's press release from Altera indicates that Arria prototype FPGAs are indeed now in the sampling stage, although the full design flow environment was still in development at that time:
Altera 20 nm Arria 10 FPGAs with hardened floating-point DSP blocks are available now. Floating-point design flows, including demonstrations and benchmarks, that target the hardened floating-point DSP blocks in Arria 10 devices will be available in the second half of 2014. Customers can start designing today with Arria 10 FPGAs using soft implementations of floating point and then seamlessly migrate to hardened floating-point implementation when the design flows are available.
Still unknown are sample and production availability dates for the planned Arria 10 SoC variants containing an embedded dual-core ARM Cortex-A9 processor subsystem. And as for Stratix 10, Figure 5 above indicates that it's coming sometime next year.
Altera deserves kudos for enhancing its FPGAs' digital signal processing "chops" by broadening the DSP blocks' capabilities to encompass full IEEE 754 single-point floating point support, for broadening this support beyond high end to mainstream product lines, and for accelerating silicon availability of this support in the process. And future FPGA product families from the company will carry forward these enhanced digital signal processing capabilities.