Can ARM11 Handle DSP?

Submitted by BDTI on Wed, 04/04/2007 - 15:00

ARM’s general-purpose processor cores have long been used alongside DSP processors in products like cell phones, where the ARM core typically handles tasks like packet processing, user interface, and overall control, and the DSP handles the computationally demanding signal processing.  But as ARM has gradually upgraded its cores with DSP-oriented features, more chip and system designers are considering whether to use an ARM core as a DSP engine. The question is, how much signal processing work can an ARM core handle?

In this article, we present our independent benchmark results for members of ARM’s ARM11 core family, and look at these cores’ performance on common DSP algorithms and on video decoding. We analyze and compare the ARM11's performance to that of earlier ARM cores and selected DSP processors.

ARM Moves Towards DSP

ARM cores are low-cost CPUs that are used in a huge range of embedded products. One of the key advantages to using ARM processors is that they are ubiquitous; lots of people know how to program them, and they have strong third-party support. The earliest ARM core, the ARM7, was poorly suited to signal processing; it didn't have a single-cycle multiplier, and it was based on a Von Neumann memory architecture that didn't allow data and instructions to be retrieved simultaneously.  As a result, its signal processing performance was poor relative to DSP processors. This wasn't surprising, since the ARM7 wasn't designed to handle signal processing; it was designed as a pure CPU.  But it turns out that people have used the ARM7 for simple signal processing anyway. That's because it's readily available, it's already in products, and sometimes it's just easier to implement signal processing on the processor you've already got than to add a new one. Of course, using an ARM7 for signal processing makes sense only for applications with relatively low computational demands; the ARM7 just isn't powerful enough to handle even moderately demanding signal processing. 

As signal processing has become increasingly important in embedded applications, ARM has responded by enhancing its architectures with DSP-oriented features. The ARM9 and ARM9E, for example, both incorporate architectural features that, along with their higher clock speeds, help to improve their signal processing capabilities relative to the ARM7—though these processors still offer only modest signal processing performance. 

The ARM11, one of ARM’s newer core families, represents a significant upgrade over earlier ARM architectures in terms of signal processing features. Of particular interest are the new 8-bit and 16-bit SIMD (single instruction, multiple data) instructions, which are intended to accelerate video and audio processing. (ARM refers to these instructions as media processing extensions.) The ARM11 includes a 64-bit data bus (vs. 32 bits on earlier ARM processors) to help make use of these additional computational capabilities. The ARM11 also has a deeper pipeline, enabling higher clock rates.  As we explain in more detail below, all of these factors combine to give the ARM11 significantly better signal processing performance than earlier ARM cores.

Assessing the ARM11

BDTI has assessed the digital signal processing performance of the ARM11 using two of their highly respected benchmark suites. The first results we'll present are for the BDTI DSP Kernel Benchmarks, which consist of 12 common DSP algorithms (such as FIR filters and FFTs). Each algorithm is carefully optimized on each target processor, mirroring how such functions are typically implemented in signal processing applications.  A processor's results from BDTI's 12 kernel benchmarks are used to evaluate its speed, energy efficiency, memory efficiency, and cost performance. These results are also used to generate the processor’s BDTImark2000TM score. The BDTImark2000 is an overall DSP speed metric, with a higher score indicating a faster processor. Figure 1 illustrates the ARM1176 and ARM1136 results on the BDTImark2000, alongside those of several other processors. (BDTImark2000 scores for additional processors are available on BDTI's web site, at  /Services/Benchmarks/DKB.)

=center>

figure 1

Figure 1: Certified BDTImark2000™ results. The BDTImark2000 is an overall measure of processors' signal processing speed, based on the BDTI DSP Kernel Benchmarks. BDTIsimMark2000 scores are measured on simulators rather than on hardware, and may use projected clock speeds.

As shown in Figure 1, the architectural changes and higher clock speed make the ARM11 cores significantly faster than the ARM9E, and slightly faster than the MIPS24KEc (a DSP-enhanced MIPS core).  The ARM11 isn't the fastest core shown here; not surprisingly, the CEVA-X1620—which is a high-performance DSP core—is much faster.  The ARM11 is, however, within striking range of the speed of TI's low-cost, low-power DSP architecture, the TMS320C55x, which is commonly used in applications like cellular telephones.  And more generally, the ARM11 is fast enough that it's possible to use it as a stand-alone signal processing engine for moderately demanding applications.

The ARM11 has to run at a faster clock rate to achieve similar signal processing performance to the TI TMS320C55x, which suggests that the ARM11's energy efficiency will be lower.  And to match the speed of the CEVA DSP core, the ARM11 would have to run at roughly twice the CEVA's clock speed. Figure 2 shows the relative DSP energy efficiency for the ARM11 cores alongside that of a typical high-performance DSP core.

figure 2
Figure 2: DSP energy efficiency, based on BDTI's DSP Kernel Benchmark results.

There are two sides to the benchmark results shown in Figures 1 and 2.  In ARM’s favor, it’s clear that the ARM11 cores have sufficient signal processing performance for a significant range of applications.  On the other hand, using a dedicated DSP core can yield higher digital signal processing performance and superior energy efficiency compared to running signal processing tasks on an ARM11.  Of course, factors other than performance and efficiency often play a central role in processor selection decisions. For example, as we described earlier, it's often the case that a CPU is required for non-signal-processing functions (as it is in a cell phone) and it may be desirable to recruit that processor to run the required signal processing tasks rather than adding a separate DSP processor:  Using one core instead of two means a single software development environment, simpler system design and possibly a simpler programming model. (For many of the same reasons, it's sometimes desirable to replace a DSP core with a second CPU.) The BDTI DSP Kernel Benchmarks results indicate that an ARM11 core can play that dual role in some applications, since its overall signal processing speed is similar to that of low-cost DSP processors.

The ARM11 as Video Engine

The BDTI DSP Kernel Benchmarks results shown above evaluate the ARM11's capabilities for typical digital signal processing tasks, like speech and modem algorithms.  In addition, because video applications are increasingly common, and because CPUs are sometimes used for video tasks, BDTI has implemented its Video Encoder and Decoder Benchmarks on the ARM1176 core.

The BDTI Video Encoder and Decoder Benchmarks are based on modern video compression standards, and are representative of the video encoding and decoding workloads found in a wide variety of mobile, home, and surveillance applications. They are designed to model the computationally demanding aspects of video encoding and decoding while limiting complexity in order to reduce benchmark implementation and optimization effort.  (As a point of reference, the benchmark workload is somewhat more computationally demanding than H.264 Baseline Profile, while being much simpler to implement.)

To produce results that are relevant to real-world applications, BDTI has specified two “operating points” for measuring performance on the BDTI Video Encoder and Decoder Benchmarks:

* QVGA Operating Point. At this operating point the benchmarks process a   video sequence at QVGA resolution with a frame rate of 30 fps. This is appropriate for mobile applications such as cell phones that have small LCD displays.

* D1 Operating Point. At this operating point the benchmarks process a video sequence at standard-definition television resolution (also known as “D1” resolution) with a frame rate of 30 fps. This is appropriate for applications such as portable media players (PMPs), digital surveillance equipment, and set-top boxes.

BDTI Video Decoder Benchmark QVGA results for the ARM1176 and a high-performance media processor, the NXP PNX4103, are shown in Figure 3.

=center>

figure3

Figure 3: BDTI Video Decoder Benchmark™ Results for QVGA operating point. Utilization percentages are affected by both the amount of relevant work the processor can accomplish per cycle and by the clock speed.

The benchmark results for the BDTI Video Decoder Benchmark are presented in terms of processor utilization. The PNX4103 running at 350 MHz is roughly 20% loaded at the QVGA operating point, while the ARM1176 at 320 MHz is roughly 80% loaded. It's not surprising that the PNX4103 requires lower utilization to execute the benchmark than the ARM11; the NXP chip is a high-performance VLIW-based media processor that was designed for this type of workload. Clearly the NXP processor is capable of more demanding video processing than the ARM11. On the other hand, our benchmark results indicate that the ARM11 can handle QVGA-resolution real-time video tasks—a significant accomplishment for a CPU targeting cost-sensitive applications. (Full results for the BDTI Video Decoder Benchmark are available on BDTI's website, at http://www.BDTI.com/bdtimark/vedb.htm.)

As mentioned earlier, the ARM11 instruction set includes a number of SIMD media-oriented instructions that operate on 16- or 8-bit data; these instructions include add, subtract, sum of absolute differences, pack, and extend-and-add.  As described below, these instructions were of significant benefit in portions of the video benchmarks.

The BDTI Video Decoder Benchmark uses an inverse integer transform that is very computationally demanding. The transform is fundamentally a matrix-matrix multiplication, where pixel scaling, summing, differencing, and matrix transposition are required. A host of ARM11 SIMD instructions were used in this benchmark to maximize throughput on these operations, particularly the SIMD add and subtract instructions (e.g., SADD16, SHADD16, SSUB16, and SADDSUBX, which are all variations on 2-way 16-bit SIMD arithmetic).  The SADDSUBX instruction in particular is quite powerful; the ability to add and subtract operands in parallel results in fewer register data shuffles, and eases the task of keeping all transform coefficients in registers. 

In the inverse integer transform, the matrix row-column transpositions make good use of the ARM11's 16-bit word pack operations (PKHTB and PKHBT).  These pack instructions, which incorporate left and right shifts, replace a sequence of masking, shifting and logical or'ing that would likely be required on previous-generation ARM processors. To make the best use of the ARM11's 8-bit SIMD capabilities, some of the arithmetic portions of the decoder algorithm were recast to enable use of 8-bit values instead of 16-bit. This essentially doubled the processor's computational throughput in some sections of the code. Overall, the ARM11 SIMD media extensions yielded meaningful performance improvements. 

Moderate Signal Processing, on a Budget

Based on our benchmark results, it’s clear that high-performance signal processors (such as the Ceva core or the NXP chip) can offer significant performance advantages over the ARM11 in terms of speed and energy efficiency.  But that doesn’t mean that such devices are always a better choice. In embedded systems, many other factors typically come into play in a processor selection decision. The ARM11 has proven itself capable of handling moderately demanding signal processing tasks, including some real-time video decoding with non-trivial frame sizes. This opens up a range of interesting possibilities for system designers, who may be able to (for example) add basic video capabilities to existing low-cost ARM11-based products without having to integrate a separate signal processing engine.  For some applications, the advantages of using a single core will be worth the speed and energy tradeoff.

Add new comment

Log in to post comments