BDTI has released independent benchmark results for the Cortex-A8, ARM’s highest-performance processor core, on the BDTI DSP Kernel Benchmarks™ and the BDTI Video Encoder and Decoder Benchmarks™. The results indicate that the Cortex-A8 is significantly faster than its predecessor, the ARM1176, giving it considerable horsepower for its targeted applications. Initially, the Cortex-A8 is being used in chips for high-performance cellular handsets; it also targets set-top boxes, printers, and automotive infotainment applications.
Due to the cost- and energy-sensitive nature of cellular handsets, the Cortex-A8 is intended to be implemented using either the typical logic synthesis methodology (commonly used with licensable processor cores) or a semi-custom design style. Initial licensees creating highly optimized implementations of the Cortex-A8 are using hand-crafted library cells and other physical-level optimizations (as Texas Instruments has done with its OMAP3430 chip) for improvements in both frequency and power over traditional synthesis methodologies. For this reason, BDTI’s benchmark results for the ARM Cortex-A8 do not include clock speed, silicon area, and power consumption data based on BDTI’s standardized conditions for processor cores, and caution should be used in interpreting the Cortex-A8 benchmark results and in comparing the Cortex-A8 to other BDTI-benchmarked cores. (All other BDTI benchmark results for licensable processor cores assume a TSMC CL013G process with ARM Artisan Sage-X library and worst-case temperature, process, and voltage variations.)
The Cortex-A8 achieves a BDTIsimMark2000™/MHz score of 7.6. The ARM1176 achieves a BDTIsimMark2000™ score of 1200 at 335 MHz, or 3.6 BDTIsimMark2000™/MHz. (The BDTIsimMark2000™ is a summary measure of digital signal processing speed, distilled from a processor’s results on the BDTI DSP Kernel Benchmarks™, a suite of 12 key DSP algorithms. A higher score indicates a faster processor.) This shows that the Cortex-A8 is significantly faster than the ARM1176 on typical signal processing tasks at an equivalent clock speed. This boost in horsepower mainly derives from the NEON signal processing extensions, which allow the Cortex-A8 to execute up to four 16-bit multiply-accumulate instructions per cycle (versus two for the ARM11). In addition, BDTI expects that, due to licensees’ use of more advanced fabrication processes and hand-optimized layouts, typical Cortex-A8 implementations will achieve somewhat higher clock speeds than typical implementations of other licensable cores, further boosting Cortex-A8 performance relative to the ARM11 and other BDTI-benchmarked cores on signal processing tasks. (See ARM’s and TI’s estimates for Cortex-A8 clock speed.)
On the BDTI Video Encoder and Decoder Benchmarks™ (see Table 1 and detailed results), the Cortex-A8 requires 114 MHz for QVGA decoding—less than half the loading of the ARM1176. The benchmark results also indicate that, at the clock speeds projected by ARM and TI, the Cortex-A8 will be capable of D1 decode and QVGA encode on the BDTI Video Encoder and Decoder Benchmarks™. (The BDTI Video Encoder and Decoder Benchmarks™ are somewhat more demanding than typical H.264 Baseline Profile implementations.) DSPs tuned for video applications are likely to require even lower clock speeds, however, as indicated by results for NXP’s PNX4103. The PNX4103 system-on-chip uses a TriMedia VLIW DSP core.
|ARM ARM1176||ARM Cortex‑A8||NXP PNX4103|
|BDTI Video Decoder Benchmark™||
QVGA, 30 fps
|250 MHz||114 MHz||67 MHz|
|BDTI Video Decoder Benchmark™||
|N/A||504 MHz||290 MHz|
|BDTI Video Encoder Benchmark™||
QVGA, 30 fps
|N/A||421 MHz||177 MHz|
Table 1. Cortex-A8 performance on BDTI’s Video Encoder and Decoder Benchmarks™
Even discounting its potentially higher clock speed, the Cortex-A8 significantly outperforms earlier ARM cores on signal processing and multimedia tasks. This impressive performance will likely open up new markets for ARM. To enable effective use of the NEON extensions, ARM offers a vectorizing compiler and is also planning optimized NEON software component libraries.
To fully assess the Cortex-A8’s suitability for a new design, however, prospective users will need to carefully assess several factors. Perhaps the most important of these is the methodology to be used to create a physical implementation of the core. Although a carefully tuned Cortex-A8 layout can yield impressive performance (as figures from ARM and TI suggest), it will require significant engineering effort and will not be appropriate for all designs. In addition, the implementation approach chosen will affect silicon area and energy efficiency—crucial factors in evaluating a core’s application suitability. Finally, users need to assess how they will use the Cortex-A8’s NEON instructions, which supply much of the core’s DSP and multimedia horsepower.