DSP on General-Purpose Processors
BDTI
HOME << PRODUCTS << BDTI

Copyright © 1997 Berkeley Design Technology, Inc.
The following is a five-page excerpt from the fifteen-page PowerPC 604e analysis in DSP on General-Purpose Processors.


6.6 Motorola/IBM PowerPC 604/604e

Introduction

The PowerPC 604 and PowerPC 604e are four-issue superscalar RISC processors from IBM Microelectronics and Motorola. The processors are targeted at general-purpose desktop computing and have found design wins in the Apple Macintosh line of personal computers and in Macintosh clones. The fastest version of the PowerPC 604 operates at a clock speed of 180 MHz with a 3.3-volt supply. The PowerPC 604e is an enhanced version of the PowerPC 604 and can operate at a clock speed of 225 MHz with a 2.5-volt supply for the processor core and a 3.3-volt supply for I/O. PowerPC 604 processors are manufactured and sold by Motorola and IBM Microelectronics and are being licensed to other vendors.

The PowerPC 604 and PowerPC 604e are implementations of the PowerPC architecture specification, jointly developed by Apple, IBM, and Motorola. The PowerPC architecture specification is based on the POWER architecture, defined by IBM in the late 1980s. POWER was the first RISC architecture designed specifically for superscalar implementation. The PowerPC architecture specification has seen a number of implementations from both IBM Microelectronics and Motorola. These different implementations target different application areas, including desktop computing, automotive and industrial control, communications, and other embedded systems.

Despite its lack of many DSP-oriented features, the PowerPC 604e achieves excellent performance on floating-point DSP applications through its multiply-add instructions, four-way superscalar architecture, and high instruction cycle rate. Despite the high prices of the fastest PowerPC 604e family members, the PowerPC 604e is competitive with high-end floating-point DSPs on a cost/performance basis.
Optimizing DSP code for the PowerPC 604e is very challenging, even by DSP processor standards. In addition, the processor's many dynamic features make it very difficult to guarantee the execution time of DSP software, complicating real-time development.

Different implementations of the PowerPC architecture specification have essentially identical programming models but can vary significantly in implementation and performance. The analysis below applies only to the PowerPC 604 and 604e and does not reflect the performance of other PowerPC variants such as the PowerPC 603. In this report, the term PowerPC 604 refers to both the 604 and 604e unless otherwise noted.

The fact that the PowerPC 604 processor is available from two major vendors and is being licensed to others is an advantage for this processor.

Architecture

The PowerPC 604 uses a superscalar RISC architecture and can dispatch and complete up to four instructions in a single clock cycle. The processor operates on 32-bit instructions and integer data and on 64-bit double-precision or 32-bit single-precision floating-point data.

The PowerPC 604 architecture consists of a program control unit, two simple integer ALUs, one complex integer unit, a floating-point unit, a load/store unit, a branch unit, and instruction and data caches. The PowerPC 604e includes one additional functional unit called the CRU. This unit performs logical operations on the condition register. Figure 6.6-1 illustrates the PowerPC 604 architecture.

PowerPC 604 architecture diagram
Figure 6.6-1. Simplified PowerPC 604 architecture.

Speculative and out-of-order execution improve the utilization of the PowerPC 604's various functional units. Instruction reordering is facilitated by register renaming.

Data Path

The PowerPC 604 has independent floating-point and integer data paths.

The presence of a floating-point unit is a significant advantage in many applications.
The PowerPC 604 has limited support for converting between floating-point and fixed-point numerical formats. While the floating-point data path is often better suited to DSP algorithm computations than the integer data path on this processor, input and output data in many DSP applications must be in a fixed-point format. Conversions between fixed- and floating-point formats can be performed in three or seven instructions, depending on the direction of the conversion. However, the conversions in both directions include memory accesses that may cause pipeline stalls even if no cache misses occur. The time required to perform these conversions is therefore difficult to predict. This can be a drawback in many DSP applications.

The floating-point data path consists of a fully IEEE-754 compliant floating-point unit and thirty-two 64-bit floating-point registers. The floating-point unit is capable of operating on either 64-bit (double-precision) or 32-bit (single-precision) operands with no difference in speed, except for division operations which take longer for 64-bit operands. Since the PowerPC 604 uses a strict load/store architecture, all floating-point input operands come from the floating-point register set, and all floating-point results are stored back to floating-point registers.

The floating-point unit is pipelined and performs all operations with a latency of three clock cycles and a throughput of one clock cycle. Division operations are an exception, and take 18 clock cycles for a single-precision division and 31 clock cycles for a double-precision division. Division operations stall the floating-point unit's pipeline until the division is complete. Certain conditions such as overflow, underflow, and other conditions related to rounding and normalization of floating-point results may cause the floating-point unit to stall for one clock cycle. Additionally, when storing a single-precision floating-point number to memory, a penalty of up to 23 clock cycles may be incurred if the number is non-zero but small enough that it needs to be denormalized to fit in a single-precision representation. The PowerPC 604 provides a non-IEEE mode in which the processor avoids some of these data-dependent penalties but does not fully comply with the IEEE-754 standard for floating-point arithmetic. In non-IEEE mode denormalized numbers are simply truncated to zero.

The fact that data-dependent conditions can affect the floating-point unit's instruction timing results in unpredictable execution times for DSP algorithms. Additionally, it can be extremely difficult to predict the worst-case execution times for some algorithms. This can be a drawback in applications with real-time constraints.
The penalty incurred when a single-precision floating-point store instruction must denormalize the number being stored can be especially problematic. Although it occurs infrequently in most applications, this penalty can be quite severe, and can therefore make the worst-case execution time of an algorithm much larger than its average execution time. This problem is alleviated when the processor is in non-IEEE mode.

The PowerPC 604's floating-point unit supports multiply-add and multiply-subtract operations.

Although the floating-point unit is capable of performing MACs with single-cycle throughput, only one input operand can be loaded from memory per clock cycle, compared to two loads per cycle on most programmable DSPs. Therefore, throughput of one cycle per MAC can only be sustained when algorithmic transformations are used to reduce the memory bandwidth requirement from two loads per MAC to only one. This is not always possible, and requires unrolling loops which increases code size. When such transformations are not possible, even throughput of two cycles per MAC can only be sustained using loop unrolling. However, the PowerPC's clock rate is significantly higher than that of most DSPs, so the PowerPC 604's performance is competitive with DSP processors despite this limitation.

The integer data path consists of two simple integer ALUs, a complex integer unit, and a set of thirty-two 32-bit general-purpose registers. The simple integer ALUs perform simple arithmetic operations such as addition and subtraction, as well as logic operations. Each simple integer ALU also includes a barrel shifter for shift and rotate operations. The complex integer unit is used for multiplication, division, and string functions. It can perform multiplications with a latency of three clock cycles and a throughput of one cycle as long as one operand is 16 bits or less in length. Full 32-bit by 32-bit multiplications have a latency of four clock cycles and a throughput of two cycles. The three integer execution units are independent and operate in parallel.

Since the complex integer unit can perform multiplications with single-cycle throughput (assuming at least one operand is 16 bits in length) and the simple integer units can perform additions in the same clock cycle, single-cycle integer MAC throughput is theoretically possible on the PowerPC 604. However, the same memory bandwidth restrictions discussed above for the floating-point data path also apply to the integer data path.

In addition to basic arithmetic and logical operations, the integer execution units provide some powerful bit-manipulation operations, such as rotate-mask-insert.

The bit-manipulation instructions on the PowerPC 604 can be a strong advantage in some applications.

Memory System

The PowerPC 604 has a single 32-bit address space. Separate 16 Kbyte instruction and data caches are available on-chip. Twice as much cache RAM is available on the PowerPC 604e, which provides separate 32 Kbyte instruction and data caches. Accesses to the instruction cache are 128 bits wide, providing the processor with four 32-bit instructions in a single cycle if a cache miss does not occur. Accesses to the data cache are 64 bits wide. However, since general-purpose registers on the PowerPC 604 are only 32 bits wide, the processor can only take advantage of the full 64-bit data cache access width when fetching or storing 64-bit double-precision floating-point variables.

The PowerPC 604 supports virtual memory via separate instruction and data TLBs for fast address translations. Cache and TLB parameters are listed in Table 6.6-1.

Like most general-purpose processors, the PowerPC 604's memory space is byte-addressable. The PowerPC 604 supports both big-endian and little-endian byte ordering, but lacks support for misaligned little-endian accesses. The PowerPC 604e adds support for misaligned little-endian accesses.

The PowerPC 604 allows both the instruction and data caches to be locked. Each cache can be locked individually, but each cache must be locked as a whole. That is, it is not possible to lock only a portion of a cache. The PowerPC 604 also includes cache control instructions, including instructions that initiate pre-loading of a cache block, invalidate a cache block, set a cache block to zero, and flush a cache block. Support for maintaining coherency between caches on multiple processors is provided.

The ability to pre-load and lock the caches can be particularly important in some DSP applications. Locking the caches can guarantee real-time performance of critical sections of code, at the cost of a severe reduction in performance for portions of the application that require instructions or data that are not in cache. The ability to lock the caches is an advantage, but is not as flexible as the ability to lock only certain portions of the caches.

Since the PowerPC 604 has only one load/store unit, only one load or store can be performed per clock cycle. The PowerPC 604 uses a four-entry load buffer and a six-entry store buffer to reduce stalls. When a load or store misses the on-chip data cache, it is posted to the appropriate buffer. Other loads or stores can be executed out of order while the cache misses in the load and store buffers are processed. This allows several cache misses to occur before the load/store unit is completely stalled.

.
Instruction Data
Cache size 16 KBytes on 604
32 KBytes on 604e
16 KBytes on 604
32 KBytes on 604e
Cache associativity 4 way 4 way
Access width 128 bits 64 bits
Line size 32 bytes 32 bytes
Write policy n/a write-back/write-through
TLB page entries 128 128
TLB associativity 2 way 2 way
TLB block entries 4 4
Block size128 KBytes - 8 MBytes 128 KBytes - 8 MBytes
Virtual address space 52 bits 52 bits
Table 6.6-1. PowerPC 604 on-chip cache and TLB parameters summary


The above is a five-page excerpt from the fifteen-page PowerPC 604e analysis in DSP on General-Purpose Processors. For a list of other topics covered in the analysis, please see the Table of Contents of the report.


Top of page