In May, Texas Instruments disclosed its first implementation of the ARM Cortex-A8 processor. Started in 2003 and reportedly involving a team of forty-five engineers in TI’s wireless handset chip unit, the implementation represents a massive effort considering the Cortex-A8 is just one element of the complex SoCs designed by TI for cell phones. Typically, ARM cores are implemented using logic synthesis exclusively. For the Cortex-A8, TI instead took the very labor-intensive path of hand crafting many parts of the implementation to squeeze more performance and energy efficiency out of the core. TI’s Cortex-A8 implementation will first be used in the company’s OMAP3 family of chips. Announced in February, OMAP3 is a family of application processors that mainly target multimedia-rich “smart phones.” With OMAP3, TI aims to boost performance on an array of smart phone applications such as video playback, gaming, and email.
In TI’s implementation of the Cortex-A8, three different design methodologies were used: random logic synthesis, structured data path, and full custom. Random logic synthesis is the most automated methodology and is the approach used exclusively for most processor core implementations on SoCs. With TI’s structured data path methodology, logic cell placement is defined by hand at the gate level. For full-custom blocks, circuits were designed at the transistor level with all routing specified by designers. TI also implemented a full-custom clock distribution grid for the Cortex-A8, with repeaters, clock buffers, and power switches allocated to specific bays interspersed throughout the chip.
Figure 1. A floorplan of TI’s Cortex-A8 implementation, showing which portions were implemented using which design methodology. Green and dark blue indicate random logic synthesis (RLS); light blue indicates full custom; white with blue tiling indicates structured data path (SDP) or mix of SDP and RLS.
TI hasn’t released any detailed performance numbers, but claims the Cortex-A8 will generally have a clock rate roughly twice that of the ARM11 in OMAP2 chips. The latest OMAP2 parts clock in at 330 MHz, which translates to a clock rate of roughly 660 MHz for the Cortex-A8 in OMAP3. While the hand-crafted design approach contributes significantly to this higher clock rate, other contributing factors include microarchitectural changes in the Cortex-A8, such as a 13-stage pipeline and migration to a 65 nm process (OMAP2 parts use a 90 nm process). TI is also reporting a 1.5x improvement in instructions per cycle for a broad range of integer benchmarks and applications, and a 2 to 8x improvement on streaming media kernel benchmarks, fueled mainly by the processor’s NEON media-processing instruction set extensions. BDTI has not yet had the opportunity to do an in-depth analysis of the Cortex-A8, but BDTI has benchmarked the ARM1136. The ARM1136 BDTI DSP Kernel Benchmarks™ results show that the ARM1136 is about as fast as mid-range DSPs. This suggests that the Cortex-A8 is likely to be fast enough to compete with all but the fastest DSPs. For more on the Cortex-A8 see the October edition of Inside DSP.
It is interesting to note that while the Cortex-A8 has the power to handle demanding multimedia apps, OMAP3 parts also include a powerful C64x+ DSP core as well as image and video accelerators intended to handle the most common multimedia tasks such as audio and video decoding. In many applications the required multimedia processing can be done on these co-processors, leaving the multimedia capabilities of the Cortex-A8 and it’s NEON extensions unused . However, the image and video accelerators are not user-programmable. TI supplies co-processor software for the most common codecs, but expects its customers to use Cortex-A8 for other codecs and other multimedia functions. Another application in which TI anticipates making heavy use of the Cortex-A8 is gaming: TI believes the Cortex-A8 is particularly suited to game physics computations.
By hand crafting the Cortex-A8 implementation, TI continues to invest in the OMAP platform to meet the needs of next-generation, multimedia-rich phones. TI will face stiff competition, however, from numerous multimedia application processor vendors. Freescale’s i.MX31 and i.MX31L chips, for instance, feature an ARM1136JF-S core running at speeds up to 665 MHz and contain a host of image and video acceleration units. While the ARM1136JF-S will not be as fast on signal processing tasks as the Cortex-A8 with NEON, it will operate at similar clock speeds and will benefit from a larger base of optimized code than is currently available for the Cortex-A8. While the OMAP3 platform will most likely continue the success of OMAP2, it remains to be seen how the extra performance offered by the hand-crafted Cortex-A8 will be utilized in the OMAP3 platform. The first OMAP3 part, the OMAP3430, is currently shipping, but available only to high-volume customers.