Inside DSP on Low Power: Processors for Low-Power Signal Processing

Submitted by BDTI on Tue, 06/01/2004 - 15:00

In many low-power applications, the processor is a major contributor to the overall system energy consumption. Hence, the processor typically plays a key role in determining a product's battery life. The choice of processor also affects many other critical aspects of the system, such as price and performance. In this article we explore processor options for low-power signal processing applications. We begin with a discussion of the criteria to consider when selecting a processor for a low-power signal processing application. Next, we highlight energy-efficient architectural approaches. We then describe the categories of processors used in low-power signal processing applications and explore the strengths and weaknesses of each category. Finally, we take a close look at a new category of high-performance, energy-efficient chips known as "application processors."

Selection criteria
Maximizing battery life is a top priority for most low-power applications. Typically, a processor's run-mode energy efficiency and its standby-mode power consumption play key roles in determining battery life. On-chip integration, particularly the size of on-chip memory, also plays a key role in determining battery life. Off-chip transactions such as memory accesses consume far more power than equivalent on-chip transactions. Hence, increasing on-chip integration tends to increase energy efficiency. A processor must also provide the right speed and cost for the application. And it is not enough for the processor itself to be inexpensive; the processor must enable an inexpensive overall system design. This typically implies that the processor must include substantial on-chip memory and peripherals.

Energy-saving tactics
Processor designers use a wide variety of techniques to maximize battery life. In this section we briefly discuss three key techniques: processor parallelism, data type selection, and operating mode flexibility. provides further detail on these techniques and describes others not covered here.

Parallelism plays a particularly important role in determining processor energy efficiency. By using a higher level of parallelism, a processor can accomplish more work per clock cycle. Hence a higher level of parallelism allows a processor to operate at a lower clock speed, which in turn enables use of a lower supply voltage. As explained in "Designing Low-Power Signal Processing Systems," processors can achieve significant energy efficiency gains by operating at lower voltages. Therefore processors with high levels of parallelism tend to achieve better energy efficiency than processors with low levels of parallelism. Of course, higher parallelism only leads to better energy efficiency when the application can make use of the parallelism.

Perhaps the simplest means to achieve higher parallelism is by using a processor architecture that can perform multiple operations per clock cycle. One particularly effective way to do this is to encode multiple parallel operations in each instruction. For example, most processors targeting signal processing applications provide a multiply-accumulate (MAC) instruction that performs both a multiply and an addition. A related approach is to design the processor to execute multiple instructions in parallel. This provides more flexibility in terms of the operations that can be performed in parallel, but it comes at the cost of increased energy consumption.

Another subtle but important way to increase parallelism is the use of "smart" peripherals such as direct memory access (DMA) controllers. A DMA controller can move data between various processor resources without intervention from the processor, which frees the processor to perform other tasks—or to remain in a low-power standby mode.

In many low-power applications, the processing load varies dramatically over time. Consider the processing requirements of a cell phone: in standby mode, the phone only requires a modest amount of processing to "listen" for an occasional incoming call. In contrast, the phone requires significant processing speed in talk mode. Similarly, the demands on the peripherals and other processor resources may vary over time.

In recognition of these varying demands, processors targeting low-power applications typically offer multiple operating modes. Some processors offer only two modes: a fully-on "active" mode and an idle mode that disables the clock signal to the processor core but leaves the peripherals running. Other processors give the system designer more options. For example, some processors allow the system designer to enable or disable individual peripherals. Such flexibility is desirable because it lets system designers use only the processor resources needed at any given time. 

Some processors also offer a "voltage scaling" feature. This feature allows the processor to operate at one voltage at its top speed and at a lower voltage at lower speeds. Because energy efficiency is related to the square of the operating voltage, this voltage-scaling feature can provide significant energy savings. (For more information on idle modes and voltage scaling, see "Designing Low-Power Signal Processing Systems.")

The selection of data types is an underappreciated factor in determining processor energy efficiency. If the processor's native data width is larger than needed, the processor wastes energy operating on "extra" bits. On the other hand, a processor with a small data width requires extra processing steps to handle wide data types.

Processor categories
In this article, we classify the many processors targeting low-power signal processing into a handful of types. Such classification allows us to make useful generalizations about each type of processor—and gives you a big-picture perspective that will help you zero in on the most appropriate processor types for your application.

Table 1 lists the five processor categories discussed in this article and presents an example processor from each category. Next we describe all five types of processors and discuss the strengths and weaknesses of each type.

    Table 1    

To achieve the ultimate in energy efficiency, system designers can create application-specific integrated circuits (ASICs) that implement their algorithms directly in dedicated, fixed-function logic. Often such dedicated logic is accompanied by a microcontroller core to handle overall control and other miscellaneous functions. When dedicated logic is too inflexible or too time-consuming to design— which is often the case—ASIC designers can use more powerful processor cores, rather than dedicated logic, to handle their signal processing tasks. The most energy-efficient type of processor core is the "application-specific instruction processor" (ASIP).These processors are custom designed for the application at hand.Traditionally, designing an ASIP was a labor-intensive manual process. Today, however, a few companies offer automated tools that generate ASIPs based on parameters supplied by the system designer.

ASIC designers can also achieve good energy efficiency by starting with a processor core and then customizing the core to the needs of their application. Although most licensable processor cores can be customized to a limited extent, the processor cores offered by ARC and Tensilica are specifically designed for customization by the system designer. Both companies' offerings allow the system designer to add custom instructions that can produce massive energy efficiency gains.

Alternatively, ASIC designers can use a processor architecture that has already been specialized for the needs of their application. For example, Philips' CoolFlux licensable DSP core is designed specifically for low-power audio applications such as hearing aids.

Unfortunately, designing an ASIC is typically an expensive process. As a result, ASICs are attractive options only for applications with very high volumes or loose cost constraints.

In general, microcontrollers (MCUs) are too slow and energy hungry for low-power signal processing applications. However, some MCUs offer features that make them attractive for applications with modest signal processing demands.

A number of factors limit MCUs' signal processing capabilities. First, many MCUs feature four-bit or eight-bit data paths, and most signal processing applications use data types that are wider than eight bits. Even when eight bits is enough—or when the MCU offers a 16-bit data path—MCUs tend to be inefficient at signal processing tasks. For example, most MCUs do not include a hardware multiplier. In addition, MCU clock speeds are typically limited to the low tens of megahertz.

MCUs are also relatively energy hungry. In active mode, typical energy-efficient parts operate at roughly 2 mW/BDTImark2000™. In addition, MCUs rarely offer features that allow fine-grained control of power consump tion. For example, MCUs typically cannot disable individual peripherals.

Despite these disadvantages, MCUs are attractive for some energy-constrained signal processing applications. First, some MCUs offer miserly power consumption in standby mode. For example, Texas Instruments claims its MSP430 F155 consumes only 3.5 W at 2.2 V in standby mode and 0.2 W at 2.2 V in the processor's "off" mode, which preserves the contents of on-chip RAM. In addition, some MCUs can operate over a range of voltages, which enables the processor to continue operating as the battery voltage decays over time.

Most MCUs offer fairly modest amounts of on-chip peripherals and memory. However, MCU families often contain dozens of derivatives. For applications that require only modest integration, this often makes it possible to find an MCU with just the right mix of on-chip integration. And a few MCUs feature DMA controllers, which can dramatically improve the performance of the MCU on signal processing tasks.

Given that MCUs are intended for low-speed, low-cost applications, their modest integration is often appropriate and sufficient. Indeed, this low level of integration allows MCUs to offer very low cost: some MCUs cost less than a dollar in high volumes.

Embedded GPPs
The term "general-purpose processor" (GPP) is commonly used to refer to all manner of microprocessors, ranging from microcontrollers to workstation CPUs. In this article, we focus on the subset of GPPs that target low-cost embedded applications. For example, we include most processors based on the ARM and MIPS architectures in the embedded GPP category, but do not include most processors based on the x86 instruction set.

Embedded GPPs are generally faster and more efficient than MCUs at signal processing tasks. Unlike MCUs, most embedded GPPs do include a hardware multiplier. Also unlike MCUs, most embedded GPPs use 32-bit data paths—and low-power signal processing applications rarely need more than 32 bits of precision. Unfortunately, many embedded GPPs cannot operate efficiently on data sizes less than 32 bits. This can create significant inefficiencies in signal processing applications that need only 16 or 24 bits of precision. Typical embedded GPP clock speeds range from about 70 MHz to about 500 MHz.

Some embedded GPPs are designed specifically for low-power applications. In standby mode, some embedded GPPs consume only a few tens of microwatts of power—more power than the most miserly MCU standby power, but still quite low. More typical embedded GPPs consume tens of milliwatts of power in standby mode. In active mode, embedded GPPs typically operate in the range of 0.5 to 2.0 mW/BDTImark2000™—typically better than energy-efficient MCUs.

Most embedded GPPs do not allow fine-grained control of power consumption. For example, embedded GPPs typically cannot disable individual on-chip peripherals, and most embedded GPPs offer only one or two idle modes.

The level of on-chip integration varies widely among embedded GPPs. The most highly integrated embedded GPPs offer a wealth of on-chip peripherals. However, embedded GPPs tend to offer fairly modest amounts of on-chip memory, and few embedded GPPs include DMA controllers and other smart peripherals. As might be expected given the broad variation in speed and on-chip integration, pricing for embedded GPPs varies widely. Typical prices vary from well under ten dollars to the multiple-tens-of-dollars range.

Many digital signal processors (DSPs) are designed specifically for low-power signal processing applications. As a result, many DSPs offer both good signal processing speed and good energy efficiency. Much of this efficiency comes from DSPs' relatively high levels of parallelism. For example, most DSPs can execute arithmetic operations in parallel with load and store operations. Clockspeeds for energy-efficient DSPs range from about 50 MHz to about 750 MHz.

DSPs also offer advantages in terms of data sizes. Most DSPs offer 16-bit data paths that are good matches for the needs of communications applications. Some newer DSPs also offer good support for eight-bit data, which is particularly useful for video processing applications. A few DSPs such as Motorola's DSP563xx family offer 24-bit data paths that are well suited to audio applications.

DSPs often support multiple standby modes. In the lowest-power standby mode, some DSPs consume only a few tens of microwatts of power—on par with the lowest standby power levels found among embedded GPPs.

In active mode, DSPs typically operate in the range of 0.05 to 0.50 mW/BDTImark2000™—much better than the energy efficiency of MCUs and embedded GPPs.

Unlike GPPs, DSPs also tend to include numerous features that allow fine-grained control of power consumption. For example, some DSPs such as the Texas Instruments TMS320C55x allow the programmer to manually disable individual on-chip peripherals and other processor resources that are not in use. Some DSPs also automatically suppress on-chip clock signals in order to disable unused portions of the chip. A few DSPs, most notably the Analog Devices Blackfin family, also support voltage scaling. This feature allows the operating voltage to be raised or lowered in tandem with the operating frequency. This capability can lead to dramatic energy efficiency gains in some applications.

DSPs generally offer high levels of on-chip peripheral integration. Unlike embedded GPPs, most DSPs include DMA controllers and other smart peripherals, and many DSPs also incorporate dozens or even hundreds of kilobytes of on-chip memory. The prices for DSPs are similar to those for embedded GPPs: typical prices vary from about 5 dollars to about 40 dollars.

Application Processors
Application processors are a category of processor intended for use in cell phones, PDAs, and other portable multimedia devices. For an in-depth look at this category of processors, see the Applications Processors section at the end of this article.

Which processor is best?
The good news for system developers is that they have more choices than ever before: processors ranging from low-cost microcontrollers to high-speed DSPs now offer strong energy-saving features. The bad news is that selecting the right processor from this expanding pool is a complex, multifaceted process.

As low-power signal processing applications continue to evolve, processors that offer ease of use and flexibility are likely to become particularly appealing choices. Such processors give system designers the ability to finetune their applications to achieve maximum battery life.

Assessing Energy Efficiency
In this article, we present energy efficiency in terms of milliwatts per BDTImark2000™; the BDTImark2000 is a summary measure of signal processing speed. All energy efficiency figures include the processor core, on-chip memory, and on-chip peripherals. Figures listed in this article do not include power for off-chip I/O. Although we present only a single number for each processor, a given processor's energy efficiency can vary significantly from application to application. For a detailed discussion of how to assess energy efficiency for a more complex scenario, see "Processor Power Consumption: Beyond the Data Sheet."

Application Processors 
Application processors are a category of processor intended for use in cell phones, PDAs, and other portable multimedia devices. Application processors are intended to handle user applications, such as personal information management software and audio and video applications, but not communications processing. (For a discussion of trends in the cell phone and PDA markets, see "Long Live the Battery.") Application processors are capable of running "open" OSs such as Windows CE and the Palm OS.

In this section we divide application processors into five subcategories based on their approaches to handling audio and video tasks—the main signal processing tasks an application processor is likely to handle. We explore the architectural techniques used by each processor group and examine the strengths and weaknesses of each.

Most application processors are based on the ARM architecture. As a result, the following discussion assumes the ARM architecture is used as a starting point for all approaches. Table 2 presents the five subcategories of application processors and lists some vital statistics for an example processor from each.

table 2

Basic GPP 
The simplest application processor architectures don't add any other processing hardware to the ARM core. That is, they use the ARM core for all processing tasks. Like other embedded GPP architectures, the ARM architecture is not particularly efficient at signal processing tasks. Consequently, application processors that rely on the ARM core for multimedia tasks tend to be less energy efficient than other types of application processors.

Some of these processors achieve very high clock rates: for example, the Samsung S3C2xxx family operates at up to 533 MHz. However, even processors in this subcategory with high clock rates have unimpressive signal processing speed due to the inefficiency of their architectures on signal processing tasks.

Application processors that use a basic GPP architecture tend to offer little on-chip memory; typically, the on-chip memory system is limited to a few tens of kilobytes of cache memory. However, some processors in this subcategory are available in multi-chip packages where the processor and a memory chip are stacked together inside the same package. For example, some members of the Samsung S3C2xxx family are available in a multi-chip package that includes the processor, 32 megabytes of SDRAM, and 32 megabytes of flash ROM.

DSP-Enhanced GPP
Some application processors improve the energy efficiency of the GPP architecture by adding signal processing-oriented features. In some application processors, most notably Intel's new PXA27x family, these enhancements are extensive. The PXA27x family implements Intel's Wireless MMX extensions, which allow the PXA27x to (for example) perform four 16-bit multiply-accumulate operations with a single instruction. Such extensions help this class of processors achieve moderate energy efficiency on a range of signal processing tasks.

Some processors with DSP-enhanced GPP architectures are quite speedy. For example, the PXA27x operates at up to 624 MHz.The combination of high clock rate and moderate parallelism gives the PXA27x impressive speed on signal processing tasks.

The memory systems of DSP-enhanced GPPs resemble the memory systems of the basic GPPs. Most DSP-enhanced application processors offer little on-chip memory. However, some DSP-enhanced processors are available in multichip packages that include many megabytes of stacked memory.

DSP coprocessors
Even with extensive enhancements, the ARM architecture is at best moderately energy efficient on signal processing tasks. In contrast, DSP architectures are designed for energy efficiency on signal processing tasks. Not surprisingly, some applications processors such as Texas Instruments' OMAP5910 combine an ARM processor core with one or more DSP processor cores. Processors in this category tend to be more energy efficient than—and in some cases are faster than—processors that use one GPP core.

Processors in this subcategory also tend to offer significant amounts of on-chip memory—a few hundred kilobytes is typical. In addition, some processors in this subcategory are available in a multi-chip package that includes many megabytes of stacked memory.

Although application processors with DSP coprocessors can be speedy and energy efficient, they can also be difficult to program. For example, it can be difficult to determine how to partition tasks across the processor cores.

Programmable accelerators
DSP coprocessors and programmable accelerators are conceptually similar. Both are programmable processing engines designed to accelerate signal processing tasks. However, most DSP coprocessors are designed to handle a wide range of signal processing tasks, while most programmable accelerators are designed to handle only a few specific tasks. As a result of their higher degree of specialization, programmable accelerators tend to offer better speed and energy efficiency than DSP coprocessors.

Although application processors with programmable accelerators can be speedy and efficient, they often use unusual and difficult-to-understand architectures. For example, the NeoMagic MiMagic 6 uses a programmable accelerator called the "Associative Processing Array." This accelerator is an array processor composed of a 512 - 160 matrix of 1-bit processing elements. Programming unusual architectures like the Associative Processing Array can be challenging. Processor vendors typically provide prepackaged multimedia software components to ease application development. However, even when such components are available, it may be necessary for the application developer to do some programming of the coprocessor.

Perhaps more importantly, the specialization of programmable accelerators means that they have limited flexibility. For example, a programmable accelerator designed for audio tasks is unlikely to be useful for video tasks.

Hardwired accelerators
Hardwired accelerators implement key signal processing tasks directly in fixed-function hardware, rather than in software. As explained in the discussion of ASICs, dedicated hardware offers the ultimate in energy efficiency. Hence, application processors employing hardware accelerators typically offer better energy efficiency than other application processors. This approach also offers straightforward software development: using a hardware accelerator typically requires little more than passing the accelerator a few parameters.

The key drawback to this approach is inflexibility. For example, the Motorola i.MX21 includes hardwired accelerators for H.263 and MPEG-4 encoding and decoding. While these accelerators provide excellent energy efficiency and speed for MPEG-4 decoding, they are not useful for other video algorithms such as H.264, let alone non-video tasks such as audio decoding.

Add new comment

Log in or register to post comments