Inside DSP on Video: Smart Processor Picks for Digital Video

Adam Lins contributed to this article.

Processor vendors offer a dizzying array of options for digital video applications. Selecting the right processor from the myriad options—and understanding the tradeoffs associated with each choice—is key to getting digital video products to market at the right time and at the right price.

In this article we introduce basic concepts needed to select a processor for a digital video application. We begin with a discussion of the processing needs of a typical video application. Next, we present a methodology for structured selection of a processor. We then present a guide to the processor options, and explore the strengths and weaknesses of each option.

Application requirements
To understand the processing needs of a digital video, consider a typical consumer digital video product. A generalized software model for a consumer digital video product is shown in Figure 1.

A few key software components dominate the composition of a typical consumer digital video application. These components are a “player,” digital rights management (DRM) software, decoder(s) and/or encoder(s) (collectively known as “codecs”), pre-processing and/or post-processing, I/O software, and the operating system. The requirements for these components are summarized in Table 1 and are discussed in detail below.

Table 1

(For a look at the hardware components inside a typical consumer digital video product, see the Panasonic personal video recorder discussed in David Carey's "Under the Hood" column.)

Player
The player is the software module that links user, content, and codec. The player’s capabilities vary depending on the features of the application. In general, the player provides the user interface (perhaps via an on-screen menu system) and controls the DRM, codec, and relevant I/O modules.

The player is typically a highly complex piece of software with a large set of features. Consequently, the player is often one of the largest components in terms of both memory footprint and lines of code. Despite its complexity, the tasks and responsibilities of the player generally require only a modest amount of computation resources—typically less than 10% of the application's total computational load.

Due to their size and complexity—and their generally modest computational requirements—players are often implemented entirely in C or C++ and compiled without extensive optimization. This makes the availability of a high-quality compiler key to good player performance.

Digital rights management
The DRM module is concerned with protecting content, typically using an encryption scheme. Different DRM systems are used for different types of content and by different content distributors. For example, the type of copy protection used on DVDs differs from the type of copy protection used by Internet-based movie distribution services. DRM schemes generally use algorithms with relatively low computational demands, and they operate on the compressed-data side of the codecs where data rates are lower. DRM implementations have small footprints; a few kilobytes of memory for both instruction and data is typical. DRM algorithm providers do not want their algorithms to become known to the public—particularly to hackers who could break their copy-protection schemes. Consequently, DRM software is often provided as binary (that is, precompiled) code. Because binary code can only be used for one specific processor family, the availability of the required DRM software can be an important consideration when selecting a processor.

Video codec
Most modern video products make use of video compression—either decompression only, for playback devices, or compression and decompression, for recording-and-playback devices. Video codecs are typically among the most demanding tasks in a digital video product in terms of computational load, memory use, and bandwidth requirements. Consider the memory bandwidth requirements for MPEG-4 decoding shown in Table 2. Even at modest resolutions and frame rates, the output data requires a bandwidth of multiple megabytes per second. In practice, the total bandwidth requirements are much greater than the values shown in Table 2: for each pixel in the output video stream, multiple memory accesses are typically required during the decompression process.

As shown in Table 2, processing demands increase proportionally as output image resolution and frame rate increase. Even at moderate resolutions and frame rates, video codecs usually must be carefully optimized for best performance. Since video codec software is generally large and complex, the required optimization effort can be high. For this reason, system designers often purchase existing video codec implementations rather than creating their own. Therefore, the availability of off-the-shelf optimized implementations of the relevant video codecs is often a key consideration in processor selection.

Note that the computational load of most video codecs is asymmetric. That is, encoding is typically many times more computationally demanding than decoding. In addition, video encoders are much more complicated and require far more memory bandwidth than video decoders.

Video Pre/Post-Processing
Digital video applications often include "pre-processing" and "post-processing" components that process the video stream prior to compression and after decompression. These components are typically used to improve image quality (for example, removing visible artifacts of the compression process) and to convert the video stream to a different format (for example, converting from one color representation scheme to another). These processing steps are often extremely computationally demanding. In many applications, pre- and post-processing requires more computational power than the video codec does. Although they are computationally intensive, pre- and post-processing algorithms are generally very simple. As a result, these algorithms require little program memory.

For more details on the requirements of video codecs and pre- and post-processing algorithms, see "Squeeze Play: How Video Compression Works".

Audio Codec Requirements
Most video applications include audio. As shown in Table 3, audio codecs are generally much less demanding than video codecs in terms of computational power and memory bandwidth requirements. Audio codecs also have smaller memory footprints and are easier to program than video codecs.

Table 3

As with video codecs, the computational load of most audio codecs is asymmetric—encoding is more computationally demanding than decoding, typically by a factor of two or more. In addition, audio coders are much more complicated and require far more memory bandwidth than audio decoders.

Input/Output
The input/output (I/O) software component manages the movement of data between peripherals and the other software components. The processing load and memory footprint of I/O software is typically small. However, as digital video products grow more complex, so too do the protocols and standards for connecting them. This results in greater I/O hardware and software complexity. For example, the software "stack" for an Ethernet interface is far more complex than the stack for a simple three-wire serial port.

Operating System
The operating system (OS) provides and controls access to basic services and resources used by the application. Example services include inter-process communications and scheduling of tasks. Example resources include memory (supported via memory management functionality) and I/O devices (supported via device drivers).

An OS is expected to contribute only a small percentage of the overall computational load of a complete application. However, as the complexity, capability, and flexibility of an OS increases, so too does its memory footprint. The memory footprint can range from a few kilobytes to several tens of megabytes.

As with DRM software, availability is a key consideration for OSs. It is usually impractical for a system designer to port an OS to a new processor. If system designers wish to use a particular OS, they must consider whether candidate processors already support this OS.

Processor Selection Methodology
For many digital video product developers, development schedules are short and profit margins are thin. Selecting a processor that best fits the product's needs is critical to the success of the product. Processor selection is difficult, though, because the field of processors options is large and rapidly changing.

The first step to narrowing the options is to consider key features required of the product (for example, decoding a D1-resolution MPEG-2 stream in real time), key product attributes (for example, cost and size), and product development issues (such as the time and manpower available for development).

Processor selection criteria should then be created based on these considerations and ranked or weighted according to their relative importance to the product. An initial set of candidate processors can then be selected and assessed in detail based on the most important selection criteria. The selection can then be fine-tuned using the less critical selection criteria.

Speed
Because video applications are often computationally demanding, one of the first steps in processors selection is to determine whether a candidate processor has the necessary horsepower. The most obvious way to evaluate the performance of a processor on a particular algorithm, such as MPEG-2 video decoding, is to analyze existing implementations of the algorithm on that processor—provided existing implementations are available. But be careful: the performance of a particular algorithm on a particular processor can vary widely depending on many factors, ranging from the particular video stream being processed to the level of optimization applied by the software developer.

If the key algorithms are not already implemented on the candidate processor, a sense of the performance requirements for a given algorithm can be gained by analyzing the resource requirements for implementations of the algorithm on other processors. But due to architecture differences, different processors often perform quite differently on the same algorithm. As a result, this type of analysis requires a detailed understanding of the algorithm and architectural differences between processors.

Alternatively, performance can be predicted by studying appropriate algorithm kernel benchmarks. But this approach has pitfalls, too. In particular, focusing too narrowly—for example, only considering a processor's performance on an 8-bit discrete cosine transform (DCT)—can yield misleading results since the processor's performance on other kinds of important operations or different data types may not be accounted for.

Memory Bandwidth
In addition to sufficient computation speed, the processor must also have adequate memory and I/O bandwidth. Many low-cost embedded processors have very small on-chip data caches—too small to hold critical blocks of data for video decompression algorithms—and these can become a serious bottleneck to performance. Small instruction caches can also be a bottleneck if the application code is large and complex.

Energy Consumption
The energy consumption requirements for a line-powered device such as a digital surveillance system differ markedly from the requirements of a portable device such as a portable video player. Obviously, processor energy efficiency is critical in portable applications, where it plays a major role in battery life. Selecting an energy-efficient processor can have other subtle benefits for both line-powered and portable devices. These benefits include smaller, cheaper packaging, simplified cooling systems, and lower overall system cost.

When evaluating a processor, it is important to look at more than its peak power consumption. If a processor can execute the workload without running flat-out, then energy consumption can be reduced using a lower clock rate or periodic sleep cycles. The processor may also provide dynamic energy conservation features such as voltage scaling. On-chip integration and memory use also affect energy efficiency. For example, if a processor requires less frequent access to off-chip memory, this can also lower system energy consumption.

Cost
System designers are very mindful of cost when selecting a processor. But it's important to recognize that system cost and processor cost are related in subtle ways. On-chip integration plays a particularly important role in determining system cost. For example, highly integrated processors that specifically target low-cost consumer video applications may include many specialized on-chip peripherals and I/O interfaces as well as large on-chip memories. These features can reduce the number of supporting chips required, thereby lowering costs, improving energy consumption, and simplifying hardware design.

Packaging And Roadmap
Processor packaging should not be overlooked, especially when concerns of system size, cost, and manufacturability are paramount. For example, for pocket-sized consumer products like portable video players, the physical size of the processor must be minimized.

Consideration should be given to the vendor's roadmap for the processor. Will the vendor continue to make improvements to the processor and the tools? This question is particularly important if the system developer anticipates developing derivative products in the future. In such cases, the ability to re-use existing software can dramatically lower development costs.

Software Development
Indeed, the ease of developing software for a processor is a critical consideration. Typically, reference software written in a high level language such as C or C++ exists for some or all of the tasks in a digital video product. However, simply compiling the reference code usually creates very inefficient code. To created efficient code, the reference code usually must be optimized—often using assembly language for key inner loops—which can be a time consuming process.

Fortunately, chip vendors and third-party developers often provide off-the-shelf optimized video decoders and other software components. Using software modules from the chip vendors and third-party software developers can greatly reduce development risk, time, and cost. At the same time, using off-the-shelf modules places an additional evaluation burden on the system developer, since it is critical to confirm that these modules meet the functional, performance, and resource-use needs of the application before deciding to rely on them.

To increase the availability and utility of off-the-shelf software components, processor vendors increasingly provide software frameworks, standards, and certification processes for creation of interoperable software components. Using these frameworks and components shifts the emphasis in product development from software design and development to integration, but this integration is itself a significant challenge. Most digital video products require a significant effort to design the overall software architecture, develop any missing components, integrate components from different sources, and add features.

Development becomes more difficult if the processor uses a complex, sophisticated architecture. For example, more and more video-oriented processors use heterogeneous multiprocessors architectures, with a variety of processor cores integrated onto a chip. With such devices, one of the key challenges is determining how to partition the application functionality across the various process cores. Such partitioning must take into account not only the capabilities of each core, but also the capabilities of the available inter-processor communication mechanisms provided.

Processors are becoming more complex and so is the software that runs on them. As a result, good development tools are essential. Effective development tools must include features for code generation, integration, testing, debugging, and evaluating application performance. For example, cycle-accurate profiling and non-intrusive debugging features are critical for optimizing and debugging real-time video software. And development boards need support for moving large quantities of test data into and out of the processor to enable extensive real-time testing.

A Guide to the Contenders
The field of processors options for video applications is dauntingly large. There are hundreds of processors available to choose from today, and the number is steadily increasing as new architectures and new variants on existing architectures are introduced. To help sort through the options, we classify the hundreds of processors available in the market today into a handful of types. Classification by processor type allows us to make useful generalizations about each type of processor—and gives you a big-picture perspective that will help you zero in on the most applicable processor types for your application.

Table 4 lists six of the seven processor categories discussed in this article and shows an example processor from each category. (Table 4 does not include ASICs.) In the following text, we describe all seven types of processor and discuss the strengths and weaknesses of each processor type.

Table 4

ASICs
An application-specific IC (ASIC) is a custom chip designed by the system developer. ASICs for video applications typically include a variety of processing elements, such as processor cores and hard-wired accelerators, as well as a variety of memory blocks and I/O resources. By employing highly customized and targeted hardware, ASICs can readily meet the extreme computational and memory bandwidth demands of digital video applications. At the same time, ASICs can have very low production costs in high volume.

Although ASICs can have low production costs, developing an ASIC can be a multi-million dollar, multi-year process. In consumer applications where time-to-market pressures are intense and where requirements can change quickly, long development cycles and inflexibility mean that ASICs often aren't practical. In addition, as the cost of creating an ASIC increases, the production volumes required in order to justify that cost increase. Thus, fewer and fewer products ship in sufficient volumes to justify the development cost of an ASIC.

ASSPs
Some digital video applications have sufficiently stable requirements and sufficiently broad acceptance that a number of companies have developed fixed-function "application-specific standard product" chips (ASSPs). These ASSPs provide a "black box" solution the major subsystems of the product. Figure 2 shows an example of a video-oriented ASSP.

ASSPs are similar to ASICs, except that instead of being designed by the system developer, they are designed by a chip vendor and sold to multiple system developers. Hence, ASSPs tend to share many of the strengths of ASICs. In addition, unlike ASICs, ASSPs can be used for lower-volume applications, since multiple systems can use the same ASSP and effectively share the ASSP's development costs. ASSPs also have an advantage in that they are generally bundled with key software components such as the video decoder.

ASSPs also share some of the weaknesses of ASICs. In particular, ASSPs are prone to the same inflexibility problems as ASICs. This inflexibility is especially problematic in ASSPs because it means system designers have less opportunity to differentiate their products from other products that use the same ASSP. ASSPs also pose a risk of roadmap divergence: your ASSP vendor's product roadmap may not be a good match for the way you want to evolve you digital video product in the future.

FPGAs
Field-programmable gate array (FPGAs) are not typically thought of as "processors," but they are increasingly employed to do video processing. A FPGA contains an array of reconfigurable logic, programmable interconnect resources, I/O blocks, and (in some cases) specialized fixed-function blocks. An FPGA is like an ASIC in that it can be configured to meet the requirements of an application and in that it can provide massive computational power and memory bandwidth. Unlike ASICs, FPGAs are very flexible, and FPGA-based designs can be readily upgraded to implement new features or adapt to emerging standards. Unfortunately, FPGAs' flexibility limits their speed, energy-efficiency, and cost-efficiency. For example, FPGAs are typically far less energy-efficient than ASICs or ASSPs.

While reconfigurable logic can be a great match for video algorithms, an instruction-set processor is usually needed to run things like a DRM module and an OS. In addition, implementing a video algorithm on an FPGA can require substantially more effort than implementing the same algorithm on one of the other types of processors discussed in this article. For these reasons, FPGAs are typically used in conjunction with one or more supporting instruction-set processors.

Media Processors
Media processors are a specialized class of programmable processors designed for high performance in audio and video applications. Media processors are based on powerful, highly-parallel processor cores that are very efficient at video-processing tasks. In addition, media processors often include multiple video accelerators in the form of programmable coprocessors or fixed-function hardware. Media processors also tend to have multiple blocks of on-chip memory and high memory bandwidth. Figure 3 shows an example media processor.

While the many specialized features of a media processor can boost performance, they also complicate the programming model, making software development more difficult compared to other programmable processors. To address this disadvantage, media processor vendors often provide optimized software component libraries.

Media processor vendors often structure their development tool chain to require the use of C or C++ compilers and prohibit the use of assembly language. This strategy relies on the quality of the compiler to achieve good results—and compiler quality can be mixed. Developers may still need to invest considerable effort hand-tuning their software for best performance.

Digital Signal Processors
Digital signal processors (DSPs) are designed to meet the needs of signal processing applications such as digital video equipment. DSPs resemble media processors in many ways: many DSPs feature powerful, highly-parallel architectures, multiple blocks of on-chip memory, and high memory bandwidth. In addition, some DSPs include video accelerators. However, DSPs are less specialized than media processors, and DSPs tend to be less efficient than media processors at video-processing tasks.

Traditionally, DSP architectures have been a compiler writer's worst nightmare, and many high-performance DSPs have complex architectures that make it difficult to hand-optimize software. Fortunately, some new DSPs are notably compiler-friendly, and it has become easier to program DSPs using C or C++ rather using than hand-crafted assembly language. In addition, the tools for most DSPs have many features that aid development of video applications.

DSPs traditionally did not maintain software compatibility from generation to generation, which made code re-use difficult. However, DSP vendors are increasingly making backwards-compatibility a feature of their new architectures.

DSPs typically support "real-time" OSs like VxWorks but do not support "full-featured" OSs like the Palm OS. Consequently, many products that require a full-featured OS often use a DSP to handle video processing and a general-purpose processor to run the OS.

Embedded General-Purpose Processors
Until recently, 32-bit embedded general-purpose processors (GPPs) were only fast enough to handle low-end video processing tasks. Today, increasing clock speeds are enabling embedded GPPs to take on more demanding digital video applications. In addition, embedded GPPs are beginning to gain increased parallelism and to add specialized video features. For example, the latest version of the ARM architecture includes instructions for accelerating video compression. Embedded GPPs also excel at control-oriented tasks, such as running an operating system.

Embedded GPPs are already common in many types of products. If a system designer wants to add digital video features to an existing product, it may be possible to implement these features on the embedded GPP without adding much new hardware. And embedded GPPs often have an advantage even for all-new products: they're typically backed by a sophisticated software development infrastructure and legions of programmers. Indeed, embedded GPPs are generally easier to program than the other processor classes discussed here.

The roadmaps for embedded GPP architectures are generally clearer than the roadmaps of the other processor classes discussed here, with backwards compatibility taken almost for granted. In addition, any given embedded GPP architecture is usually available from multiple vendors. In contrast, most other types of architectures are available from only a single vendor.

High-Performance General-Purpose Processors
High speed, high-performance GPPs originally developed for personal computers are increasingly used in digital video products. High-performance GPPs used in digital video applications offer clock speeds in excess of 1 GHz and highly parallel architectures. In addition, high-performance GPPs have excellent software development tools, and they are supported by an enormous number of programmers and off-the-shelf software providers. Backwards compatibility has long been important in the desktop computer market, and this also benefits other applications that use high-performance GPPs, preserving development effort across generations of products.

High-performance GPPs have disadvantages in the areas of size, energy consumption, and cost. Video products that use high-performance GPPs almost always require multi-chip solutions, where the high-performance GPP is paired with a chipset that provides interfaces to memory and peripherals. This multi-chip approach increases system cost. Energy consumption is generally high, and high-performance GPPs can require the use of cooling fans. These tradeoffs make high-performance GPPs unattractive for very cost-sensitive and energy-sensitive applications. However, high-performance GPPs they can be attractive in the realm of high-end, non-portable video products, such as set-top boxes and personal video recorders.

Which is Best?
There is a dizzying array of processor designs available for digital video applications, and the options continually expand and evolve. Selecting the right processor from the myriad options is a challenging task. To make the right choice, system designers must carefully evaluate the demands of their applications and the capabilities and tradeoffs associated with candidate processors.

No single option is "best" for all digital video applications, but programmable processors are clearly gaining ground. Even fairly low-end embedded GPPs are now able to handle some real-time video processing. As digital video applications rapidly evolve, processors that offer the most flexibility and strongest application development infrastructure—including the relevant off-the-shelf software components—will become ever more attractive.

Inside DSP on Video: Smart Processor Picks for Digital Video

Add new comment