Fixed-Point DSP Processors
BDTI
HOME << FREE INFO << PROCESSOR OVERVIEWS << BDTI

 

Lucent DSP16xxx Pushes the Boundaries of Traditional DSP Architecture

Introduction

In September 1997, Lucent Technologies introduced their latest fixed-point DSP processor family: the DSP16xxx. Rather than branching out in a completely new direction (as Texas Instruments has chosen to do with the VLIW-like TMS320C62xx), Lucent decided to extend the traditional DSP architectural style. The `16xxx is based on Lucent's earlier `16xx family of DSPs, but Lucent expanded the data bus from 16 to 32 bits, and included lots of extras in the data path to boost the performance of the DSP16xxx in communications applications. Like older DSPs, the DSP16xxx sacrifices ease of programming and generality to achieve strong performance in DSP applications.

The first member of the DSP16xxx family, the DSP16210, is expected to begin sampling at 100 MIPS and 3.3 volts in December 1997, with volume shipments planned for mid-1998. Pricing for the 144-pin TQFP and 132-pin BQFP packages in quantities of 1,000 pieces is projected to be around $100. The DSP16210 is pin-compatible with the DSP1620, and Lucent plans to supply an assembly-code translator to help users port '16xx code to the '16xxx. These two facts suggest that Lucent intends to leverage the success of the DSP16xx in the communications market. Lucent has garnered a 30% share of the DSP market (according to DSP market research firm Forward Concepts) by making highly specialized communications-oriented DSPs and selling them to big communications equipment vendors, and it looks like the DSP16xxx is poised to continue the tradition.

Architecture

Although the DSP16xxx data path is designed to process 16-bit data, the processor uses 32-bit data buses internally as shown in Figure 1 . The 32-bit data buses allow the DSP16xxx to process both 16-bit and 32-bit instruction words. (It also lets the DSP16xxx to operate on pairs of adjacent 16-bit data words, but we'll talk about that later.)

Figure 1: DSP16210 Architecture

The 32-bit data buses are the first improvement you'll notice over the older DSP16xx, which uses 16-bit buses. To see the other major improvements, we need to take a look at the data path.

Data Path

So, what exactly are the data path "extras" I mentioned? First, the DSP16xxx includes not one, but two independent 16x16-> 32 multipliers. In addition, the `16xxx sports an extra 3-input 40-bit adder (separate from the ALU) and a bit manipulation unit, as shown in Figure 2. Each of the units is designed to operate with single-cycle throughput, and certain combinations of operations can be encoded in a single instruction and executed in parallel. The DSP16xxx is able to use its 32-bit data path to perform single-instruction, multiple-data (SIMD)-style additions and subtractions in the ALU by splitting 32-bit data inputs into pairs of 16-bit operands. With its two multipliers and separate ALU and adder, the DSP16xxx can produce two MAC results per instruction cycle—a powerful feature shared by only a select few high-performance DSPs. The DSP16xxx also includes what Lucent calls a ``traceback encoder'' to facilitate Viterbi decoding. The bit manipulation unit performs arithmetic and logical barrel shifting, bit field insert and extract, and exponent detection and normalization. Eight 40-bit accumulators (each providing 8 guard bits) help support the many execution units.

The `16xxx data path has been liberally sprinkled with limited-capability shifters and hardware saturation units, designed to ease implementation of bit-exact communications algorithms such as the EFR-GSM standard. Shifting and saturation are controlled by mode bits, as are a number of other processing parameters. In fact, the DSP16xxx uses lots of mode bits, mainly because this is an effective method of containing code size. A side effect of this technique is increased programming difficulty; the programmer must know not only which instruction is being executed, but also the status of all the mode bits in order to determine exactly how the processor will respond to a given instruction.

Figure 2: DSP16xxx Data Path

Memory

The DSP16xxx uses a modified Harvard memory architecture, with dual on-chip data and address buses. One bus set processes instructions and constant data (no writes allowed); the other processes non-constant data. Address buses are 20 bits wide, data buses are 32 bits wide. On-chip memory consists of a 60Kx16 dual-port RAM and 8Kx16 dual-port ROM on the DSP16210. Words are 16 bits wide, but can be accessed as 32-bit double-words if they are stored as pairs in memory. Such pairs are not required to be aligned on 32-bit boundaries, but if they aren't, the core may have to wait an extra instruction cycle for the data. A single-cycle penalty is also incurred if both bus sets attempt to access the same 1Kx16 block of on-chip memory.

Addressing modes include register indirect with post-modification options, indexed, and modulo addressing (to support circular buffering). The DSP16xxx also provides good support for immediate data.

The DSP16xxx includes a 31-word instruction cache, which can contain either 16-bit or 32-bit instructions. The cache is loaded via the DO instruction, which invokes a zero-overhead hardware-assisted loop for a specified number of repetitions. When instructions are executed from cache, both bus sets can be used to transfer data, increasing the effective memory bandwidth of the processor. The on-chip memory bandwidth for the 100 MIPS DSP16210 (assuming words are arranged as pairs in memory and instructions are executed from cache) is 400 million 16-bit reads per cycle. The cache does not support nested hardware loops, which complicates programming and detracts from its performance in some applications.

The two internal bus sets are multiplexed onto a 16-bit data and 16-bit address external memory interface. Several chip select lines are also generated and sent off-chip. The maximum off-chip memory bandwidth of the 100 MIPS DSP16210 is 100 million 16-bit reads per cycle. This is only a quarter of the rate possible using on-chip memory, so it's especially important to make efficient use of on-chip memory on the DSP16xxx.

Programming

The DSP16xxx uses a C-like assembly language syntax that is very similar to that of its predecessor, the DSP16xx. Instructions words can be 16- or 32-bits wide. Many of the 16-bit instructions available on the DSP16xxx are similar (or even identical) to those found on the `16xx, but the two processors are not software compatible. As mentioned earlier, Lucent Technologies plans to supply an assembly-code translator to help users port `16xx code to the `16xxx. The DSP16xxx 32-bit instructions typically allow more operations to be executed in parallel, and provide better support for parallel data moves. Let's look at a sample 32-bit instruction:
a0=a4+p0 a1=a5+p1 p0=xh*yl p1=xl*yh x=*pt0++ y=*r1++

This instruction directs the DSP16xxx to perform two additions, two multiplications (using the high and low 16-bit halves of the multiplier input registers) and two data transfers (with address register post-increment). Instructions are executed from left to right; e.g., p0 is added to a4 before it receives the results of xh*yl.

The DSP16xxx uses a relatively simple three-stage pipeline with few hazards to ambush the unwary programmer. The pipeline is not interlocked, but the assembler automatically inserts NOPs to avoid code sequences that could generate incorrect results. All instructions (except branches) have single-cycle latency.

The DSP16xxx allows some combinations of operations to be executed in parallel, but there are severe limitations on the combinations that are allowed. For example, you can specify a round operation together with a shift operation, but you can't specify a round operation with a multiply (even though the two operations use different execution units). These restrictions mean that the DSP16xxx is not especially straightforward to program; the army of mode bits doesn't help, either. Like most DSPs, the `16xxx also places restrictions on which registers can be used for what. Lucent could have used the 32-bit instructions to implement a very regular and easy instruction set; instead, they opted to go for a higher level of parallelism than is typically found on traditional DSPs and put up with the resulting programming difficulty. DSP16xx programmers will probably be comfortable programming the DSP16xxx, both because of the similarity in instructions and because the `16xx wasn't especially easy to program, either.

Peripherals

The DSP16210 will include an 8-bit bit I/O port (each bit can be individually configured as an input or output), two timers, two serial ports, and a parallel host port. The host port and one of the serial ports have DMA capabilities.

Performance

The next obvious question to be addressed is, how does the DSP16xxx perform on typical DSP applications?

The DSP16xxx has been benchmarked using BDTI's suite of 11 DSP algorithms, which include such DSP functions as FIR filtering, IIR filtering, and an FFT. On each benchmark, BDTI measures five quantities: cycle counts, execution time, cost-execution time (a combined figure of merit), energy consumption, and memory usage. Our results indicate that the DSP16xxx per-cycle efficiency (i.e., the amount of work it is able to accomplish in a single instruction cycle) has been significantly increased over that of the older DSP16xx. In fact, it has been doubled. Hence, the DSP16210 at 100 MIPS is a substantially faster processor than the 120 MIPS DSP1620. The DSP16210 is also faster than several other traditional DSP processors, such as TI's TMS320VC549 and Motorola's DSP56303—but it can't keep up with the 150 MHz TMS320C6201. On the other hand, the BDTI Benchmarks™ have shown that the '16xxx doesn't require nearly as much memory or energy as the `C6201, making it more attractive than the TI processor for applications that are sensitive to these considerations. Lucent's goal was to balance speed, power consumption, and memory use, and the DSP16xxx reflects the results of those trade-offs.

BDTI makes the results of its composite benchmark score, called the BDTImark™ , publicly available on the World Wide Web. This score measures a processor's overall speed on the 11 BDTI Benchmarks. Below, we present the BDTImark results for the DSP16210 and a variety of other DSP processors as of September 1997. Note that, except for the DSP16210, processor speeds shown here are the fastest that are currently available—the TMS320C6201, for example, is expected to begin sampling at 200 MHz in the near future. Results for the DSP16210 are projected. For a more detailed description of the BDTImark, please refer to our white paper.


A complete analysis of this processor, including full BDTI Benchmark™ results, is contained in BDTI's report, Buyer's Guide to DSP Processors, 2001 Edition.

About the Author

Jennifer Eyre is a DSP Analyst and Manager of Analysis and Publications with Berkeley Design Technology, Inc. Ms. Eyre received her B.S and M.S. degrees in electrical engineering from UCLA, and is co-author of several BDTI technical reports, including Inside the Lucent DSP16000 .

Top of page