Buyer's Guide to DSP Processors
BDTI
HOME << PRODUCTS << BDTI

Copyright © 1997 Berkeley Design Technology, Inc.
The following is a six-page excerpt from the thiry-one-page Texas Instruments TMS320C62xx analysis from the third (1997) edition of Buyer's Guide to DSP Processors.

7.17 Texas Instruments TMS320C62xx Family

    Contents
  • Introduction
  • Architecture
    1. Data Path
    2. Memory System
    3. External Memory Interface
    4. Address Generation Units
    5. Pipeline
  • Instruction set
    1. Assembly Language Format
    2. Parallel Move Support
    3. Orthogonality
    4. Execution Times
    5. Instruction Set Highlights
  • Execution Control
    1. Clocking
    2. Hardware Looping
    3. Interrupts
    4. Stack
    5. Bootstrap Loading
  • Peripherals
  • On-Chip Debugging Support
  • Power Consumption and Management
  • Benchmark Performance
    1. Execution Performance
    2. Memory Usage
  • Cost
  • Fabrication Details
  • Development Tools
  • Applications Support
  • Advantages
  • Disadvantages

Introduction

The TMS320C62xx is the latest family of fixed-point DSP processors from Texas Instruments. The TMS320C62xx is based on a completely new architecture compared to previous DSP processor families from Texas Instruments. The processor contains eight execution units that include two multipliers and four ALUs. Using these eight execution units, the processor can execute up to eight 32-bit RISC-like instructions in a single clock cycle, enabling it to achieve a high level of parallelism. Instructions operate on 16-, 32-, or 40-bit data. The TMS320C62xx family is targeted at high-performance applications, such as wireless base stations, digital subscriber loops, multi-line modems, and ISDN modems.

Because the TMS320C62xx can execute up to eight instructions per clock cycle, the term ``instruction cycle'' is potentially ambiguous when discussing this processor. As used here, ``instruction cycle'' means the time required to execute a single group of one to eight parallel instructions. On the TMS320C62xx, one instruction cycle is equal in length to one master clock cycle. Additionally, since TMS320C62xx instructions often perform fewer operations than typical instructions on other DSPs, a MIPS comparison between the TMS320C62xx and other DSPs is not meaningful. Therefore, instead of MIPS, we use the number of MACs per second as a shorthand performance metric in this analysis.

The first member of the TMS320C62xx family, the TMS320C6201, was announced in February 1997. Currently, only advance release samples of the TMS320C6201 are available. These samples incorporate a limited set of peripherals, use a 2.5-volt core supply (with 3.3-volt I/O), and execute up to 240 million MACs per second when operating at 120 MHz. According to Texas Instruments, the full-speed production version, the 200 MHz TMS320C6201, will be available in late 1997 and will include a wider array of on-chip peripherals. (This part is referred to as a ``1,600 MIPS'' processor by Texas Instruments, since it is projected to execute a maximum of eight RISC-like instructions per clock cycle at 200 MHz. It will be capable of executing 400 million MACs per second when running at 200 MHz.) The analysis presented here is based on the advance release version of the TMS320C6201 except where noted. Table 7.17-1 shows the characteristics of the advance release version of the TMS320C6201.

TABLE 7.17-1 TMS320C6201 characteristics
Part Operating Voltage
(V)
Speed
(Millions of MACs per Second)
On-Chip Memory
Program RAM
On-Chip Memory
Data RAM
Notes
C6201 2.5/3.3* 240 16Kx32 32Kx16 Two-channel DMA, 16-bit host port interface†

* The core operates at 2.5 volts while the peripherals are 3.3-volt compatible.
† According to Texas Instruments, the production version of the TMS320C62xx, slated for release in late 1997, will include two TDM-capable buffered serial ports and two timers. The production version is projected to operate at 400 million MACs per second at 200MHz.

By using an architectural approach similar to those of VLIW (very long instruction word) processors, the TMS320C62xx achieves a high level of parallelism with a simple architecture. This is done by avoiding the need for complex instruction scheduling and dispatch hardware in the processor. Instead, the burden of instruction scheduling is shifted to the code generation tools or the assembly language programmer. This results in a simpler and faster processor architecture compared to processors with dynamic instruction scheduling. VLIW architectures typically suffer from several disadvantages, such as high program memory usage and complexity in designing efficient compilers. The TMS320C62xx architecture includes several features designed to reduce program memory requirements and alleviate other disadvantages typically associated with VLIW architectures. These features include instruction packing, conditional execution for all instructions, and variable-length instructions, all of which are discussed below. Despite these features, the TMS320C62xx consumes more program memory than other fixed-point DSPs, as detailed in our discussion of benchmark results, below.

Architecture

The core architecture of the TMS320C62xx family consists of two fixed-point data paths, a program control unit (including program fetch, instruction dispatch, and instruction decode units), and program and data memory interfaces. Figure 7.17-1 illustrates the TMS320C62xx family architecture as typified by the TMS320C6201.

[FORWARD]

FIGURE 7.17-1. TMS320C6201 processor architecture. Dashed blocks indicate peripherals that are to be added in the production release of the part, according to Texas Instruments.

Data Path

The TMS320C62xx has two nearly identical data paths. As illustrated in Figure 7.17-2, each data path has a set of four execution units, a general-purpose register file, and paths for moving data between memory and the data path. The execution units in each data path consist of L, S, M, and D units. Typically each unit operates on 32-bit operands, but the L and S units can also operate on 40-bit (``long'') operands. As described below, each execution unit is capable of performing a dedicated set of operations.

[FORWARD]

FIGURE 7.17-2. TMS320C62xx data paths. Each data path includes four execution units (L, S, M, and D), described in the text. The arrow between the data paths denotes the cross-paths that allow each data path to access the register file of the other data path.

  • The L units (L1 for data path one, L2 for data path two) each contain a 40-bit integer ALU. They are used for 32/40-bit arithmetic and compare operations, 32-bit logical operations, normalization, and bit count operations. The L units support saturated arithmetic for 32/40-bit operands via dedicated saturation instructions. All L-unit operations execute in a single instruction cycle.
  • The S units (S1 for data path one, S2 for data path two) each contain a 32-bit integer ALU and a 40-bit shifter. The S units are used to perform 32-bit arithmetic, logical and bit field operations, and 32/40-bit shifts. In addition, they are used for branching, constant generation, and register transfers to and from control registers. All S-unit operations execute in a single instruction cycle. The only exceptions are branch instructions which have single-cycle throughput but introduce five delay slots.
  • Multiplications are performed by the M units (M1 for data path one, M2 for data path two), which are capable of performing 16x16->32-bit multiplications. Multiplier operands may come from the higher or lower 16-bit portions of any 32-bit general-purpose register. Special multiply instructions support multiplication of lower and/or higher 16-bit register portions, enabling the use of pairs of 16-bit operands packed into 32-bit registers. Multiply operations have a latency of two instruction cycles. However, the multiply step is pipelined making it possible to issue one multiply operation for each multiplier per instruction cycle. The multipliers support integer multiplications for signed, signed/unsigned, and unsigned operands. In addition, fractional multiplication is supported for signed operands.

The two-cycle latency of the multiplier complicates programming and forces the use of software pipelining in typical DSP algorithm implementations (e.g., convolution).

  • The D units (D1 for data path one, D2 for data path two) each contain a 32-bit adder/subtracter. They are used for address generation including linear and circular address calculations. Because each data path has one D unit, the processor can perform a total of two address calculations in one instruction cycle.

In the best case, all units operate in parallel, and the processor performs four arithmetic operations, two multiplications, and two address calculations in one instruction cycle.

Due to the processor's high clock speed and its ability to perform a maximum of eight operations in parallel, the TMS320C62xx offers a significant performance improvement over other DSP processors. This enables the use of a programmable DSP processor in applications that previously required multiple processors or custom hardware.

However, in many cases the processor's maximum level of parallelism cannot be achieved due to application constraints and resource limitations. E.g., on the BDTI Benchmarks(TM), the typical level of parallelism achieved in algorithm kernels is six to seven instructions per cycle.

Note that the TMS320C62xx execution rate of eight RISC-like instructions per instruction cycle, i.e., ``1,600 MIPS'' at 200 MHz, cannot be directly compared to the performance of typical DSP processors. For example, a MAC instruction consisting of one multiplication, one addition, and two parallel moves is implemented as a single instruction on conventional DSP processors, but as three instructions on the TMS320C62xx.

The TMS320C62xx provides two register files, A and B, each containing 16 32-bit general-purpose registers. These registers can be used for storing addresses or data. The registers are labeled A0-A15 for data path one and B0-B15 for data path two. To support 40-bit arithmetic, pairs of adjacent registers can be used to hold 40-bit (``long'') data. In this case the 32 LSBs are stored in an even-numbered register and the 8 MSBs are stored in the 8 LSBs of the next (odd-numbered) register. The remaining bits of the odd-numbered register are zero filled.

The TMS320C62xx implements a load-store architecture: operands must be loaded into the registers before they can be used by the execution units. Generally, the execution units of data path one operate on registers in register file A and the units of data path two operate on registers in register file B. However, the register files are interconnected to the opposite data path's functional units via cross paths. This allows both data paths to fetch one 32-bit operand per instruction cycle from the register file of the other data path.

In each data path, each execution unit has its own read and write ports to its register file. Thus, all the execution units in each data path can access the local register file simultaneously. This means that in an ideal situation all execution units in both data paths operate independently and eight simultaneous operations can be performed. However, some restrictions apply, the most significant of which are:

  • Only one 40-bit (long) result can be written to each register file per instruction cycle.
  • A 40-bit (long) register read cannot be issued in the same instruction cycle as a memory write from the same register file.
  • Two simultaneous memory accesses cannot use registers of the same register file as address pointers.
  • More than four reads of the same register cannot be performed in one instruction cycle.

These limitations are minor considering the complex architecture of the processor and should not cause any major problems in most applications. However, selecting 40-bit arithmetic reduces the number of available registers and restricts parallelism. For example, only one 40-bit result can be written to each register file per instruction cycle and a 40-bit register read cannot be issued in the same instruction cycle as a memory write from the same register file.

Overflow protection is supported on the TMS320C62xx via saturation logic and 40-bit arithmetic. Saturation is supported by the L units via special instructions, such as add and subtract with saturation (SADD and SSUB). These instructions perform the indicated arithmetic operation and, in case of overflow, saturate the result to the largest or smallest value that can be represented using 2's complement arithmetic. The saturated result is either a 32- or 40-bit value, depending on the width of the destination register. The L, S, or M unit automatically sets the saturation bit in the control status register when saturation occurs; this bit can only be cleared via an explicit instruction. The L and S units can operate on 40-bit operands, which corresponds to having a 32-bit register with eight guard bits. A dedicated saturate instruction can be used to convert a 40-bit value to 32 bits and saturate the result.

Besides the bit that indicates the occurrence of saturation, no other status bits (carry, negative, etc.) are provided by the TMS320C62xx data paths. The TMS320C62xx does not provide hardware rounding.

The lack of common status bits is not a problem on the TMS320C62xx; due to its parallel architecture and conditional instruction execution, status bits can be implemented (emulated) in software without a significant performance penalty.


Copyright © 1997 Berkeley Design Technology, Inc.
The above is a six-page excerpt from the thirty one-page Texas Instruments TMS320C62xx analysis from the third (1997) edition of Buyer's Guide to DSP Processors.
Top of page