ARM recently announced the Cortex-A9, a high-performance licensable application processor that extends ARM’s push into the multi-core arena. The Cortex-A9 provides support for multi-core implementations via ARM’s “MPCore” technology, which includes hardware for maintaining cache coherency and managing memory transfers. MPCore can be used with up to four Cortex-A9 cores in a symmetric multiprocessor (SMP) configuration. The Cortex-A9 is also available as a single processor (i.e., without the additional MPCore hardware). Target applications include mobile phones, consumer electronics, and automotive “infotainment” systems, among others.
Running multiple cores at a relatively low clock frequency is often more energy-efficient than running a single core at a higher frequency, so multi-core solutions can be attractive for embedded applications that require strong performance coupled with low power. ARM’s MPcore technology has previously been used with the ARM11 processor (licensees include Renesas and Nvidia, among others), and ARM envisions the Cortex-A9 as a lower-power, higher-performance alternative to the ARM11.
According to ARM, the Cortex-A9 is synthesizable and can execute at 1 GHz in a generic TSMC 65 nm process, though power-sensitive applications that use multiple cores will likely clock them at a lower rate. ARM has not disclosed the area required for a 1 GHz implementation of the core, but says that the core consumes 1.5mm^2 for a lower-speed 65 nm implementation. (This area figure is for a single Cortex-A9 core only, and excludes caches and optional co-processors.)
A block diagram of a four-core Cortex-A9 MPCore implementation is shown in Figure 1.
Figure 1. Cortex-A9 MPCore processor structure, four-core implementation. The floating-point unit (FPU) and NEON co-processor are optional. (Figure courtesy of ARM.)
The Cortex-A9 is quite similar to the earlier Cortex-A8―both cores implement the ARMv7 instruction set architecture―but there are a few differences. Like the A8, the A9 implements a dual-issue superscalar pipeline, but the A9’s pipeline is a shorter “dynamic length” 8-stage pipe (compared to 13 stages on the Cortex-A8). In addition, the new core’s pipeline supports speculative out-of-order execution. This feature enables the processor to dynamically reorder instructions to improve performance (to avoid stalls due to instruction latencies, for example). The processor predicts which branches will be taken, and speculatively executes the code that is most likely to be chosen. This technique is used to reduce branching penalties.
Out-of-order execution is unusual in embedded processors because of the silicon cost, and it’s unusual in processors used for real-time signal processing because it increases execution-time variability. But embedded core vendors are under pressure to bump up their processing horsepower, and ARM isn’t alone in going the out-of-order route―MIPS’s new high-performance superscalar 74K core uses this technique, too.
As with the Cortex-A8, the key to the Cortex-A9’s digital signal processing capabilities is the optional NEON co-processor, which adds DSP-oriented SIMD capabilities (such as quad 16-bit multiplications) along with floating-point capabilities. If floating-point is all that’s needed, the core can be configured to include just a floating-point unit (FPU) which provides single- and double-precision floating-point capabilities. ARM has not disclosed the area required by NEON or the FPU; NEON in particular is likely to significantly increase the area required by the core. ARM also has not disclosed apples-to-apples area information for the A8 and A9, though it appears that the A9 will be smaller; previous data for the Cortex-A8 indicated that it consumes “less than 3mm^2” in a 65 nm TSMC process, excluding caches.
BDTI has benchmarked the signal processing performance of the Cortex-A8 with NEON, but has not yet benchmarked the Cortex-A9. However, it’s likely that the per-cycle throughput of the two cores will be similar on DSP tasks. Thus, the Cortex-A9’s smaller size and support for multi-core implementations are likely to be the key differentiators relative to the Cortex-A8.
The Cortex-A9 is likely to be compared to MIPS’s 74K core, which is also superscalar, synthesizable, and expected to run at 1 GHz. According to MIPS, a 1 GHZ 74K core consumes 1.7 mm^2 in a TSMC 65 nm process, excluding caches. This is slightly higher than the area ARM has reported for the base Cortex-A9 core, but such small differences are not particularly meaningful since memory usually takes up much more space on a chip than the core itself. (The addition of NEON, however, will probably increase the ARM core’s area considerably.) The MIPS 74K doesn’t have explicit hardware support for multi-core implementations, though MIPS has stated that it plans to add multi-threading capabilities to the 74K, as it did with the 34K.
As mentioned earlier, BDTI has not benchmarked the Cortex-A9 or 74K, but based on our benchmark results for the Cortex-A8 with NEON and MIPS’s estimates of the 74K’s signal processing horsepower, it appears that a Cortex-A9 with NEON will be significantly faster on typical DSP algorithms. It’s unclear how the two cores will compare if NEON is excluded.
Fundamentally, the success of the Cortex-A9 multicore strategy depends on whether the embedded world is ready to embrace a multi-core SMP solution and accept the associated programming challenges. In applications where the workload is well-defined, well-understood, and fairly static, more traditional performance boosters (like application-specific co-processors) will probably be more energy- and area-efficient. For applications where the workload is less predictable, however, ARM’s SMP approach may prove quite attractive.