

# Inside the Siemens TriCore

A Technical Evaluation

by the staff of

Berkeley Design Technology, Inc.

Excerpted and abridged for reprint.

The following is an excerpt from BDTI's report, *Inside the Siemens TriCore*.

Contents of this excerpt include:

- *Introduction*
- *Scope*
- *The TriCore Processor Core*
- *Benchmark Performance on:*
  - *Execution Time*
  - *Memory Usage*
- *Summary*

The complete report may be ordered from BDTI. Details are on page 4.

## Introduction

*Inside the Siemens TriCore* is a technical evaluation, published by BDTI (Berkeley Design Technology, Inc.) in February 1999. It provides an in-depth technical analysis of Infineon's (formerly Siemens Semiconductor's) processor core, TriCore.

TriCore is a hybrid DSP/microcontroller, designed to support both DSP- and control-oriented tasks. In *Inside the Siemens TriCore*, the technical staff of BDTI evaluates the DSP capabilities of TriCore and explores how TriCore's architecture addresses the needs of DSP applications. The report includes both a detailed qualitative analysis of TriCore's architecture, and a quantitative evaluation based on TriCore's performance on a series of DSP benchmarks developed by BDTI.

At the time the report was written in early 1999, initial production devices based on TriCore were expected to run at speeds of 66-80 MHz using a 2.5-volt supply. The core is targeted at high-performance disk drives, digital imaging,

communications systems, and automotive engine management systems.

TriCore will mainly be used in Infineon-designed application-specific chips with varying configurations of on-chip memory and peripherals. In addition, Infineon offers TriCore for licensing.

The TriCore architecture was introduced in September 1997, and the first version of the processor core, the TriCore 1, was announced in March 1998.

## Scope

*Inside the Siemens TriCore* is intended for anyone interested in understanding the DSP performance and capabilities of TriCore. It assumes a basic knowledge of DSP processor concepts and terms, both of which are covered in BDTI's text, *DSP Processor Fundamentals*. *Inside the Siemens TriCore* is especially useful for electronic system designers, hardware and software engineers, processor designers, engineering managers, and product marketing managers. It will aid in the assessment of TriCore's suitability for a given application, and will allow engineers and systems designers to make informed decisions when considering TriCore for their latest designs.

For comparison purposes, the report includes brief analyses of several other processors. These processors were chosen to provide meaningful comparisons with TriCore. Texas Instruments' TMS320C27xx, Motorola's DSP568xx, ARM's ARM7TDMI/Piccolo, and Hitachi's SH-DSP are similar to TriCore in that they are all designed to address both

DSP and microcontroller tasks. Lucent's DSP16xxx is similar to TriCore in that it is a dual-MAC architecture. TI's TMS320C54x is well-known to many DSP product developers, and is included as a performance reference.

## The TriCore Processor Core

TriCore is a superscalar, hybrid DSP processor/microcontroller core. As a hybrid DSP processor/microcontroller, TriCore has features to support both DSP applications and control-oriented applications. The report focuses on the DSP capabilities of version 1.1 of TriCore, and is based on preliminary information provided by Infineon.

TriCore's superscalar architecture is built around a 32-bit fixed-point data path, a load/store unit, and a program-control unit. TriCore can execute up to

## About BDTI

Berkeley Design Technology, Inc. (BDTI) was founded in 1991 to assist companies in creating, selecting, and using DSP technology. The technical staff of BDTI has extensive experience in the development of DSP-intensive software and hardware for commercial applications. BDTI offers a variety of technical products and services, including:

- Published reports on DSP processors and technology
- DSP software development services
- Technical advisory services
- Training

three instructions per cycle—one data-path instruction, one load/store instruction, and one loop instruction. Thus, achieving the peak instruction execution rate requires data-path, load/store, and loop instructions to be carefully grouped.

TriCore's data path supports SIMD (single-instruction multiple-data) operations; the processor can split its data path to process two 16-bit words or four 8-bit bytes. Using its SIMD capabilities, TriCore can execute two 16-bit multiplies with single-cycle throughput—twice the multiplication throughput of today's midrange DSP processors.

### Mixed-Width Instruction Set

TriCore uses a mixed 16/32-bit instruction set. DSP-oriented instructions are 32 bits wide; 16-bit instructions are aimed at control-oriented processing. TriCore programmers can freely intermix the two instruction widths without the need to set mode bits.

Although TriCore's instruction set is more regular and orthogonal than that of the average DSP, optimizing assembly code on a SIMD superscalar processor carries its own difficulties. For example, the programmer must expend significant effort to pair data path and load/store instructions so execution units aren't kept waiting, and effective use of SIMD often requires special optimization and data-organization techniques that take some getting used to.

### Pipeline

TriCore uses a four-stage pipeline consisting of fetch, decode, execute, and write-back stages. The pipeline depth is comparable to that of most mainstream DSP processors and considerably shorter than that of some of the latest high-end DSPs, such as TI's TMS320C6xxx. On many DSP processors, certain combinations of instructions can cause pipeline stalls, and instructions must be carefully arranged to avoid them. Compared with most DSP processors, TriCore experiences fewer such pipeline stalls, since nearly all TriCore instructions (except taken branches and multiplications) complete in a single instruction cycle.

### Addressing

To generate the data-access patterns common in digital signal processing, TriCore supports a number of DSP-oriented addressing modes, including register-indirect with pre- and post-increment, indexed addressing, circular (modulo) addressing, and bit-reversed addressing (useful for unscrambling the inputs or outputs of some FFT algorithms). TriCore also supports zero-overhead hardware looping.

### Data Access

Data loads and stores are performed on single bits, bytes, or 16-, 32-, or 64-bit words. With the exception of bit and byte accesses, which can be performed at any memory location, all data accesses must address 16-bit-aligned memory locations. TriCore's support for single-bit accesses is mainly of interest in control-oriented code, and it is not found on any commercially available DSP processors. TriCore can perform either one load or one store per instruction cycle; on a 66-MHz TriCore, the maximum on-core data-memory bandwidth is thus 528 MBytes/s.

TriCore's limitation of one data access per clock cycle contrasts with most DSP processors, which can typically retrieve at least two separate data words while executing instructions within a tight loop. Although TriCore can retrieve up to four contiguous 16-bit data words in a single data access, the four words are routed to a single register rather than to separate registers. Programmers may need to unroll loops because TriCore does not allow the programmer to specify an offset into a 64-bit register to access and operate on its constituent 16-bit words.

### Co-Processor

TriCore includes a coprocessor, directly connected to its register file, which is capable of performing powerful bit-interleaving functions. For example, in BDTI's convolutional-encoder benchmark (based on the IS-54 standard for digital cellular phones), TriCore is able to perform the needed bit interleaving in one instruction cycle per 32-bit word. In contrast, most DSP processors chew through 40-50 cycles to do the same

thing. Although it is not considered an integral part of the core, Infineon states that the coprocessor will be included in every TriCore-based chip and is also available to core licensees. Infineon has alluded to the possibility of providing other coprocessors in the future.

## Benchmark Performance

### About the BDTI Benchmarks™

*The BDTI Benchmarks are a set of DSP software functions that BDTI has independently designed to provide an objective basis for comparing processor performance characteristics such as speed and memory use for DSP applications. The BDTI Benchmark functions are implemented in assembly language to allow a realistic assessment of processors' DSP performance. The resulting software is then verified for functional correctness, optimality, and adherence to the BDTI Benchmark specifications. Benchmark performance results are obtained through manual analysis and careful, detailed simulation, or by measurement on sample devices.*

### Execution Time

The execution time for a BDTI Benchmark function is defined as the amount of time required by the processor to complete the benchmark's initialization, kernel, and termination sections. To determine the execution time of a particular benchmark on a given processor, the number of instruction cycles the processor requires to execute the benchmark is multiplied by the processor's instruction cycle time. *Inside the Siemens TriCore* includes tables and charts illustrating the number of cycles required by each processor to execute each benchmark, and uses these results to generate corresponding tables and charts of execution times at a specified clock speed.

For each benchmark, BDTI provides quantitative results and an analysis of why the processors perform as they do.

*Note: For packaged processors, the instruction cycle times used in the subsequent analysis are for the fastest version of each processor available in sample quantities as of October 1998, according to the manufacturer. For the ARM7TDMI/Piccolo, BDTI used the fastest (to our knowledge) speed implemented at the time of the analysis, in January 1999. For the TMS320C2700 core and the TriCore core, BDTI used clock speeds projected by Texas Instruments and Infineon, respectively, for mid-1999. (Both of these processors have since achieved these clock speeds.) Actual core speeds vary with implementation. BDTI has not independently verified maximum device speed information.*

*Readers should be aware that device speeds are constantly increasing, and for this reason, we strongly recommend contacting manufacturers for updated information about maximum device speeds.*

### Sample Benchmark Results

The execution time results for BDTI's block FIR filter benchmark are shown in the figure at right. As illustrated by this benchmark result, TriCore's performance at 66 MHz is quite strong, approaching that of Texas Instrument's 100 MHz DSP, the TMS320VC549. It is interesting to note, however, that TriCore's performance relative to the TMS320C549 is somewhat less than might be expected given TriCore's dual-MAC capabilities (the TMS320C549 can execute only one MAC per cycle). The disparity arises in this case because TriCore incurs a performance penalty as a result of its data alignment restrictions. TriCore's superscalar architecture and SIMD capabilities give it an edge over the other hybrids shown here in terms of per-cycle efficiency, providing a reminder that higher clock speeds (or MIPS ratings) don't necessarily mean better performance. Like all processors with SIMD capabilities, TriCore exhibits better performance in algorithms with high data parallelism.

### Overall Execution Time Results

In addition to providing execution time results on each individual benchmark, the report also provides an overall speed measure that is generated by normalizing



the results on each benchmark and summing them together. Normalization effectively applies a uniform weighting to each benchmark for this analysis; without the normalization step, benchmarks that require more time would tend to be weighted more heavily in the overall results.

### Memory Usage

Speed is often the first metric designers use to compare processors. Memory use is also of interest, however, for several reasons. For example, memory use may have a significant impact on overall system cost. Memory use can also affect processors' performance; if application software and data cannot fit entirely in on-chip memory, a significant performance degradation may occur on many processors. Because of these and other factors, memory use is an important metric for processor selection.

For each benchmark, BDTI reports each processor's program, constant data, and non-constant data memory use. With one exception, the BDTI Benchmarks are optimized first for maximum speed, then for minimum memory usage, because this is usually the order of prior-

ities in DSP applications. The exception to this rule is the finite state machine benchmark, described below.

### Finite State Machine Benchmark

The BDTI Benchmarks include one benchmark function specifically designed to evaluate memory use for control-oriented programs. Control-oriented code usually takes up the bulk of a DSP application's memory requirements but only a fraction of the application's processing time. Thus in control-oriented code, memory use is usually a more serious concern than execution speed.

BDTI's finite state machine (FSM) benchmark is designed to be representative of control-oriented code. The primary goal for programmers implementing the FSM benchmark is minimum memory use, and the secondary goal is execution speed. Optimizations that decrease the benchmark's execution time but increase the benchmark's memory use are prohibited.

Note that memory-use results on the FSM benchmark are not necessarily indicative of processors' memory use in signal-processing-intensive code.

## Sample Benchmark Results

The memory usage results for BDTI's FSM benchmark are shown in the figure at right. As would be expected, TriCore's instruction set provides efficient support for control-oriented tasks by including a wide range of logical and bit-manipulation operations, several of which are not normally seen on DSP processors. These features, combined with TriCore's mixed-width instruction set, serve to keep code size small. In general, TriCore's code density can be expected to be comparable to that of other hybrids and better than that of most DSP processors.

### Overall memory usage

In addition to providing the memory usage results for each benchmark, BDTI provides an overall normalized memory usage result. This result allows readers to compare the memory efficiency of processors over a range of benchmarks. Because BDTI's benchmarks include both DSP-oriented benchmarks and a control-oriented benchmark, the overall result provides a good estimate for processors' relative memory efficiency in real DSP applications.

### Summary

TriCore appears well equipped to tackle the computing requirements of



applications with even moderate-to-heavy DSP requirements, and its single-core, single-tool-chain programming model may be appealing to system developers who would prefer to avoid using separate DSP and microcontroller software development environments.

TriCore's strong DSP performance gives it the potential for success in DSP-intensive applications; it remains to be seen, however, whether the architecture will fulfill that potential and become widely adopted.

### Order Form

#### *Inside the Siemens TriCore: A BDTI Technical Evaluation*

Mail this form along with a check or fax with purchase order to:

Berkeley Design Technology, Inc.  
2107 Dwight Way, Second Floor  
Berkeley, CA 94704 USA

Tel: (510) 665-1600  
Fax: (510) 665-1680  
Email: [info@BDTI.com](mailto:info@BDTI.com)

Name .....

Title/Division .....

Company .....

Address.....

City/State/Zip/Country .....

Tel: ..... Fax: .....

Email:.....

| <u>Description</u>                        | <u>Qty</u> | <u>Price</u>  |
|-------------------------------------------|------------|---------------|
| First copy                                | x          | \$950 = ..... |
| Additional .....                          | x .....    | * .....       |
| Tax (for CA orders)                       |            | = .....       |
| Shipping & Handling (for int'l, add \$60) |            | .....         |
| <b>TOTAL</b>                              |            | .....         |

\*Additional copies may be purchased at a substantial discount.

#### PAYMENT

International orders must be prepaid in US dollars.

Check enclosed, payable to Berkeley Design Technology, Inc.

Purchase order attached, number: (credit approval required)

Wire transfer (contact BDTI for instructions)