Altera's OpenCL SDK: High-Level Synthesis Done A Different Way

Submitted by BDTI on Tue, 02/12/2013 - 21:02

Within a technical article published in the August 2012 edition of InsideDSP, I wrote:

As FPGAs have evolved, the means by which engineers create FPGA designs have also evolved. In particular, design techniques employing increasingly higher levels of abstraction have been required to address the increasing chip capabilities. Initial FPGA design flows were schematic-based. These later gave way to HDLs (hardware description languages) such as VHDL and Verilog. And more recently, high-level synthesis has entered the mainstream, after years of research and development and early-adopter experimentation.

Effective high-level-language-based FPGA design flows have become desirable (by implementers) and sought after (by suppliers), since at least in theory they can enable big gains in design and verification productivity. C-language-based flows are particularly attractive in that they offer the potential for relatively straightforward hardware acceleration of functions that would alternatively run in software on a system processor. Enabling flexible (and rapid) movement of the hardware-versus-software partition in order to assess design alternatives becomes increasingly attractive once the FPGA fabric and the microprocessor share the same sliver of silicon.

And even for designers committed to existing HDLs, the associated tools must evolve to keep pace with growing device capacity and complexity, and with design methodologies that are evolving to address that growing capacity and complexity, such as the increasing reliance on design reuse and IP cores (in diverse formats from HDL source to pre-placed-and-routed blocks). In the face of intensifying time-to-market pressures in many industries, the speed and efficiency of FPGA design tools has become an increasingly critical consideration, including how those tools handle late-stage ECOs (engineering change orders).

At the time, I was writing about Xilinx's Vivado design tools suites, specifically about the HLS (high-level synthesis) tool included within the sub-$5,000 System Edition suite option. HLS, which Xilinx had obtained via the January 2011 acquisition of AutoESL and its AutoPilot product line, targets the direct implementation of C, C++, and SystemC behavioral descriptions into Xilinx FPGAs. But the overall trend towards increasingly higher levels of design abstraction is a vendor-independent phenomenon. And unsurprisingly, therefore, Xilinx's primary competitor Altera is also responding to the market need... albeit with a somewhat different solution: an OpenCL-based approach.

What is OpenCL? Here's what the website of the Khronos Group, an industry alliance which maintains the standard (along with others, such as the well-known OpenGL), says:

OpenCL is the first open, royalty-free standard for cross-platform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices. OpenCL (Open Computing Language) greatly improves speed and responsiveness for a wide spectrum of applications in numerous market categories from gaming and entertainment to scientific and medical software.

In short, OpenCL provides a means of developing, in a hardware-independent manner, code that can be easily partitioned among the various available processing platforms within a heterogeneous system...CPUs, GPUs, DSPs and (of course) FPGAs. For far more information, see BDTI Senior Engineer Shehrzad Qureshi's OpenCL presentation from the July 2012 Embedded Vision Alliance Member Summit (registration on the Embedded Vision Alliance website is required prior to accessing the video).

Altera's to-date involvement in OpenCL is multi-year and incremental in nature. According to Product Marketing Director Alex Grbic and Software and DSP Product Marketing Manager Albert Chang, the company joined the Khronos Group in 2010 and began internal development. Beginning in 2011, Altera had completed development of an initial OpenCL SDK proof-of-concept and began working with approximately 30 customers to refine it, publicizing its efforts in November of that same year.

In April 2012, the company trumpeted the success that one such early adopter partner, goHDR, had achieved with the toolset to date. And in August, Altera unveiled a formalized and expanded Early Access program, including an updated OpenCL SDK that became available in November. However, Altera is still being selective about who gets access to the SDK; the company plans to ramp up to supporting approximately 100 customers through the first half of this year, with widespread availability dependent on the outcome of this expanded access.

Altera's approach is oriented towards high-performance computing applications. In this context, OpenCL is a sensible choice of language. And as such, Altera has developed an innovative approach. By "freezing" many elements of the architecture, Altera has created a "sandbox" in which users can utilize the FPGA without doing any RTL (register-transfer level) design. This is different from the Xilinx approach, which is to use HLS for the algorithmic portions of the application, with RTL (i.e. HDL) design used for the remainder of the FPGA design.

Although FPGAs (and associated toolsets) from Xilinx and other programmable logic suppliers are clearly in Altera's gunsights with the OpenCL SDK, Altera also seems to be targeting the competition from GPUs, which are increasingly being used for non-graphics workloads in high-performance computing applications. This GPU focus is clear from two of the three case studies that Altera shared with BDTI during a recent briefing (the third being the above-mentioned goHDR example, wherein the customer was able to port relevant portions of its C code to the FPGA for hardware acceleration in less than a week's time).

The first example addressed the comparative performance and power consumption of executing Monte-Carlo Black-Scholes simulations, commonly used in the financial marketplace to calculate the value of trading options with multiple sources of uncertainty, on three different processing platforms (Table 1).

 

Quad-core CPU

GPU

Stratix IV FPGA

Number of cores

8

448

Not applicable

Simulations per second

240 million

2,100 million

2,200 million

Power (W)

130

215

21

Table 1. Monte-Carlo Black-Scholes simulation results (per Altera)

The results are seemingly impressive for the FPGA-based approach in contrast to the CPU and GPU alternatives, but keep a few qualifiers in mind. Altera was unwilling to share any details about the "quad-core CPU" aside from the total number of processor cores (eight: this fact presumably implies HyperThreading virtual core support in the CPU) and its power consumption (130 W). What was its clock speed? Architecture? Cache sizes? Supplier? Did it offer an integrated and OpenCL-compatible GPU core...and if so, was the GPU used? And was the quoted power consumption an average or peak value?

Similarly, we know nothing about the GPU aside from its shader core count (448) and power draw (215 W...peak or average?). Supplier? Product name? Clock speed? We also don't know which Stratix IV FPGA family member Altera was comparing the CPU and GPU against. And for all three silicon platforms, we don't know what the power consumption profile over time looked like. Speaking of which, a system using the FPGA-based approach will require a CPU, albeit one less powerful (therefore lower power) than otherwise needed.

Keep in mind, too, that the CPU and GPU suppliers also might have been able to create more efficient Monte-Carlo Black-Scholes algorithms for their respective processors than Altera did. And the same information insufficiencies hinder a full appreciation of Altera's other documented case study, which involved reviewing an incoming text stream of consecutive documents and best-match filtering each of them against a database of already-logged documents (Table 2).

 

Quad-core CPU

GPU

Stratix IV FPGA

Number of cores

8

448

Not applicable

Performance/watt (millions of transactions per second/Joules)

15.9

15.1

83.6

Table 2. Text search/filtering results (per Altera)

None of these qualifiers is intended to cast definitive doubt on the Altera FPGA-based OpenCL implementation in each case, only to point out the information "holes" that preclude a comprehensive evaluation of the three processing approaches on these particular case studies, not to mention a reliable extrapolation of the results to other processor alternatives and other application scenarios.

As Figure 1 shows, the application, complete with OpenCL extensions, runs through an Altera SDK-augmented compiler that in parallel generates an executable for the system's x86 processor and a bitstream for the FPGA. However, given that OpenCL is a C-based scheme, one might reasonably wonder what implementation distinctions exist between Altera's approach and Xilinx's C-based HLS alternative. Altera points out that optimized HLS-based implementations often involve the insertion of Xilinx-proprietary "intrinsics" in the C source code, thereby not resulting in "portable" software. However, Altera's current OpenCL approach is also somewhat "rigid", although in a different way.

Figure 1. Altera's SDK works in conjunction with an OpenCL-compliant compiler to generate both executable software for the system CPU and a bitstream (including required memory and I/O function blocks) for the FPGA

As mentioned earlier, Altera’s OpenCL implementation is currently x86-specific from a system CPU standpoint. The interface between the CPU and FPGA is also currently restricted to PCI Express, and the FPGA portion of the design must also include particular memory controllers; the Altera SDK automatically creates these fixed function blocks in the process of generating the FPGA design's bitstream. Altera is working with the Khronos Group to broaden the Altera OpenCL SDK's range of supported CPUs, system I/O interfaces, and the like.

Given the company's current product line and announced future-product roadmap, the desire for enhanced standards flexibility is understandable. Currently, for example, the Altera OpenCL SDK only comprehends Altera's Stratix IV FPGA product line, whose family members range in resources up to the following specifications:

  • 1M logic elements
  • 3.9 billion transistors
  • 50 Mb of integrated memory
  • Variable-precision DSP blocks, and
  • High-speed serial transceivers

Altera's Chang and Grbic note that the bulk of Altera's current high-performance computing customers interested in OpenCL-based design methodologies are Stratix IV customers. However, as the company expands the program going forward, it will likely need to expand SDK support to other FPGA product families. This expansion potentially includes the Arria II GX FPGA within the Intel Atom Processor E6x5C Series product line, which is a particularly interesting silicon platform due to the OpenCL-compatible PCI Express links between the FPGA and processor in the product's multi-die package. Since the E6x5C is an Intel offering, to which Altera supplies programmable logic but is otherwise uninvolved, Altera declined to discuss OpenCL plans for it.

Figure 2. Altera's OpenCL plans encompass its integrated "hard" CPU-plus-FPGA products, the first of which (in recent times, at least) launched last month. But at the moment, company officials decline to provide schedule specifics.

Altera was a bit more forthcoming with its OpenCL aspirations for single-die ARM-plus-FPGA products such as the dual-core Cortex-A9-based Cyclone V SoCs that the company introduced in mid-December. As Figure 2 taken from Altera's OpenCL presentation suggests, an OpenCL-based design approach has notable merit. However, it will require that Altera evolve the OpenCL SDK's support beyond its current x86 CPU-only and PCI Express interconnect-only foundations. As such, Chang and Grbic were unwilling to forecast when OpenCL SoC support might appear, either in a private beta or public form.

Permalink

ramkumarkoppu Mon, 02/18/2013 - 08:31

It is a interesting article. I agree that very High performance computing customers like financial inst, physics and bio-medical simulation tend to use Stratix family of Altera FPGAs, but I do feel that for other kind of applications like data compression, cryptography, video/audio/image processing applications low cost Cyclone family FPGAs would suffice. These customers also require OpenCL support to accelerate their algorithms. They need not to have PCIe, may need USB, Ethernet and other interfaces as well.

Other possibility would be use OpenCL on top of modified version of existing C-to-HDL framework like Vivado. I know there are few changes required in existing -to-HDL (Vivado, ImpulseC...) to make it work, but as these tools already supports almost all families of Altera and Xilinx FPGAs with proprietary extensions to C language to support fine-grained parallelism like Altera OpenCL implementation. So these tools are better candidates to adopt it for OpenCL on FPGA development. This also creates competition in OpenCL FPGA and as well as GPU market to provide better solutions to the customers.

I would like to see a single OpenCL framework implementation to work on both Altera and Xilinx to support all families of their FPGAs.

Add new comment

Log in to post comments