DSP on General-Purpose Processors |
||
| HOME << PRODUCTS << | ||
|
Copyright © 1997 Berkeley Design Technology, Inc. 5. Processor Characteristics Relevant to DSPIn this chapter, we provide an introduction to, and an overview of, those aspects of general-purpose processors that are most significant for implementors of DSP applications. We identify those facets of general-purpose processors likely to complicate the job of developing DSP applications, as well as those likely to aid DSP application development. We explain how general-purpose processor features contribute to—or detract froma processor's performance in DSP applications. In addition, we contrast general-purpose processors with DSP processors and explain the implications of the key differences between them. While we refer to specific processors to illustrate points introduced here, detailed discussions of individual processors are deferred to later chapters. Specifically, Chapter 6 consists of in-depth analyses of processor architectures and capabilities. This chapter is not intended to be a tutorial on general-purpose processors. For readers seeking an introductory treatment of general-purpose processor topics, there are a wide variety of books available. Several of these are listed in the References of this report. In our work analyzing processors' DSP capabilities, including developing and optimizing DSP benchmark software, execution predictability has emerged as having a significant difference between general-purpose processors and DSP processors. By execution predictability, we mean a programmer's ability to predict the exact timing that will be associated with the execution of a specific segment of code. Execution predictability is affected by the details of a processor's architecture, memory system, and tools. Execution predictability is important in many DSP applications because of the need to predict performance, to ensure real-time behavior, and to thoroughly optimize code for maximum execution speed. Many DSP applications are inherently real-time in nature. In these applications, such as audio compression and modems, failure to consistently meet real-time deadlines causes malfunctions ranging from a reduction in signal quality to loss of data to failure of a communications link. Such applications are sometimes said to be subject to hard real-time constraints. Ensuring real-time performance requires the ability to predict execution timing with certainty. Not all DSP applications are real-time applications, however. For example, an application designed to remove noise from old analog recordings may be able to run in batch mode, processing a recording ``off-line'' for later play-back. But lack of execution predictability may cause complications even in non-real-time applications. This is because lack of execution predictability means difficulty estimating performance prior to processor selection and difficulty optimizing code—both often critical steps for performance-hungry DSP applications. Of course, both DSP processors and general-purpose processors are fundamentally deterministic in terms of code execution timing. That is, given adequate information, it is possible to predict the exact number of clock cycles required to execute a specific segment of object code. In practice, DSP processors have straightforward architectures and are supported by tools that help the programmer determine a code fragment's execution time. Some general-purpose processors, in contrast, have extremely complex architectures (for example, superscalar architectures that dynamically select instructions for parallel execution) and lack tool support to aid programmers in predicting execution time. These factors mean that it can be extremely difficult to predict execution timing of general-purpose processor code. This creates challenges for DSP applications. In many cases, programmers writing real-time DSP applications for general-purpose processors can execute their code on the target processor and measure the run-time performance. Unfortunately, the execution timing of a specific segment of code is likely to change depending on the code that preceded it, and depending on the locations of the code and its associated data in memory. In some cases, execution timing may be data dependent. Where performance is not critical, developers make use of high-level language compilers to quickly generate application code. But the complexity of DSP algorithms coupled with high data rates means that in many DSP applications, most programmers directly write assembly code in order to obtain maximum performance from the processor. In such cases it is the application programmer, rather than the compiler writer, who must understand the intricacies of the processor's architecture, including execution timing, in order to effectively select a processor, predict performance, and optimize code. High-end general-purpose processors such as the Intel Pentium and Motorola/IBM PowerPC 604 have many features that complicate predicting execution timing and optimizing DSP code. We describe these features in the remainder of this section. To keep up with their high processing throughput, high-performance general-purpose processors use on-chip and off-chip instruction and data caches. Typically, on-chip caches account for a significant portion of the silicon area of the chip. If the needed instructions and data are contained in the on-chip caches, then the processor executes at full speed. Otherwise, the processor is stalled while the code and data are loaded into the caches, either from main memory or from a second-level, off-chip cache. For general computing applications, the probabilistic nature of cache operation can work well: frequently accessed code and data are more likely to be in the on-chip caches when needed, improving performance for the most frequently accessed portions of programs. For real-time applications, however, such probabilistic behavior can be problematic. This is because many real-time applications must satisfy real-time constraints in every instance, rather than on average or in most instances. In recognition of this fact, some general-purpose processors allow programmers to manually control portions of their caches, thereby ensuring that critical code and data are present in the caches when needed, at the cost of degraded performance for accesses to other sections of code and data. General-purpose processor caches are discussed further in Section 5.4 later in this chapter. To manage costs, many systems based on general-purpose processors rely on dynamic RAM (DRAM) devices for their main memory. Depending on the type of DRAM devices chosen and the details of the system design, accesses to DRAM-based main memory may also introduce complications for execution predictability, due to factors such as refresh cycles and increased access times when crossing memory page boundaries. Similarly, systems that rely on virtual memory further complicate execution predictability. On deeply pipelined processors, branches can be very costly. An increasingly common approach to mitigating the cost of branches is to provide hardware in the processor that attempts to predict the outcome of upcoming branches. Branch prediction can be very effective for DSP software which often contains small loops with high iteration counts. However, like caches, branch prediction is a probabilistic mechanism which complicates determining and optimizing the execution timing of a segment of object code. Branch prediction is discussed further in Section 5.7 later in this chapter.
Like DSP processors, high-performance general-purpose processors rely heavily on increased parallelism to achieve performance gains. In general-purpose processors, increased parallelism is often implemented via superscalar execution. This means that the processor dynamically selects sequential instructions for parallel execution, depending on the available execution units and on dependencies between instructions. For example, the Motorola/IBM PowerPC 604 can begin executing (issue) up to four instructions in parallel in a single clock cycle. The Intel Pentium can issue up to two instructions in parallel in a single clock cycle. In superscalar processors, there are rules governing whether a given group of instructions can be issued in parallel. These rules can be quite complex, and can include dependencies on the execution history of the instructions in question. For example, on the Intel Pentium, two instructions that would otherwise be issued simultaneously will not be issued simultaneously unless the second instruction is one byte in length, or both instructions have previously executed from the on-chip instruction cache and have not subsequently been overwritten in the cache. Run-time scheduling of instructions in superscalar processors can be quite complex, and naturally complicates both predicting the execution timing of code segments and optimizing code for maximum speed. Dynamic instruction scheduling is discussed further in Section 5.7 and Section 5.8 later in this chapter.
Some general-purpose processors exhibit data-dependent instruction execution times. This means that the execution times of some instructions depend on the operand values being processed. This is distinct from the more tractable and more common case where instruction execution time depends on the size of the operands (for example, a double-word multiply vs. a single-word multiply). When key DSP-related instructions exhibit data-dependent execution times, the job of estimating code execution time and optimizing performance is made more complicated. For example, the Advanced RISC Machines ARM7TDMI core completes an integer multiplication in two to five cycles depending on the operand values. A conservative approach to predicting code performance in the presence of data-dependent execution times is to assume the maximum execution time for each instruction. This can result in unrealistically pessimistic estimates in some applications, however. In some cases it may be straightforward to determine bounds on the ranges of certain operands, and thereby obtain more realistic execution time estimates. Unfortunately, in many applications this approach isn't practical. As mentioned above, the complexity of predicting code execution times on sophisticated general-purpose processors means that programmers typically execute their code on the target processor to determine its execution time. This is especially hazardous on processors with data-dependent instruction execution times, since the measured timing may not reflect the worst case that can be expected in the application. In fact, it may be virtually impossible to determine the true worst case execution times that can be expected in an application, due to complex relationships between signal magnitudes throughout the application. This may force the programmer to resort to the assumption that every instruction consumes its maximum execution time in every instance. Data-dependent instruction execution times are discussed further in Section 5.3 later in this chapter. There are a variety of real-time operating systems available for most general-purpose processors. Real-time operating systems are designed to aid developers in ensuring real-time behavior of applications. However, many DSP applications targeted to general-purpose processors in computers and workstations run under non-real-time operating systems, such as Microsoft Windows and various versions of UNIX, since such operating systems are already present on these platforms. So while it isn't necessarily the case that DSP applications on general-purpose processors have to contend with non-real-time operating systems, as a practical matter this is common. It may be possible to coax real-time behavior from a non-real-time operating system, especially if the operating system includes extensions to help in this regard. Developers of non-real-time operating systems are increasingly paying heed to the needs of real-time applications. However, we expect that implementing real-time applications on non-real-time operating systems will remain a significant challenge for DSP application developers for some time. Operating systems are discussed further in Section 5.15 later in this chapter. Software and hardware development tools are essential for efficient application development in general, but are especially important for performance-critical, real-time applications. In such applications, developers need the ability to analyze and predict performance in detail, to debug in real-time, and to thoroughly optimize critical sections of code. DSP processors have historically had fairly limited tool support in comparison to popular general-purpose processors. For example, debuggers and high-level language compilers for DSP processors have generally lagged their general-purpose counterparts by at least one generation in terms of capabilities and ease of use. However, the tools and on-chip debugging facilities of DSP processors generally outshine those of general-purpose processors in supporting the development and debugging of real-time systems. For example, virtually all DSP processor vendors provide clock-cycle-accurate instruction set simulators for their processors. Such simulators allow programmers to view the details of code execution on the processor for purposes of performance analysis, optimization, and debugging. DSP processors typically also provide on-chip hardware to support real-time debugging. This allows a system to run at full speed until a breakpoint condition is triggered. Such tools and capabilities are generally not available for general-purpose processors. This is especially problematic for high-end processors because of the complexity of analyzing and optimizing code for these processors. There are signs that general-purpose processor vendors are awakening to the need for tools to support real-time development. For example, Intel recently introduced the Vtune tool suite which provides detailed views of the execution of code fragments on Pentium processors. Today, however, such tools for general-purpose processors are the exception rather than the rule. Development tools are discussed further in Section 5.15 later in this chapter. The issues outlined in this section will be relevant for some applications but not for others. Similarly, these issues will be significant with some processors and not with others. In some cases, these factors can combine to create serious challenges for the developer. A system that is theoretically deterministic in its performance may exhibit apparently stochastic, or chaotic timing behavior. Before embarking on a processor selection, application development, or system design effort, engineers should carefully evaluate the application's needs and the capabilities of the processors being considered. Each of the issues outlined in this section is explored further later in this chapter. In addition, the individual processor analyses in Chapter 6 describe each processor's features and capabilities in detail.
The above is a six-page excerpt from the 154-page Chapter 5 of DSP on General-Purpose Processors. For more information on the rest of Chapter 5, please see the Table of Contents of the report. |
|
|