Getting Better DSP Code Out of Your Compiler

Submitted by BDTI on Wed, 01/03/2007 - 17:00

Compiling digital signal processing application code is not a push-button process—at least, not unless you're willing to settle for inefficient code. Signal processing algorithms (and the processors commonly used to run them) have specialized characteristics, and compilers usually can't generate efficient code for them without some level of programmer intervention.

Learning how to coax efficient signal processing object code out of a compiler is an important skill, and can reduce (or eliminate) the amount of time you'll spend optimizing at the assembly level. In this article, we'll explain how to get the best performance out of whatever compiler you're using, and how to avoid getting blindsided by common compiler pitfalls.

Learning by disassembling

A useful tool for understanding compilers' strengths and weaknesses is the disassembler. This tool takes object code and generates the corresponding sequence of assembly language instructions, allowing you to see exactly how the compiler implemented your code. You'll be able to tell whether it did a good job of using specialized processor features and parallelism, and whether the resulting code looks more or less as you expected.

You'll often find some surprising results; it's not uncommon to find compilers generating incorrect code, or overlooking seemingly obvious optimizations. In some cases, assemblers even alter hand-coded assembly files, which you may not realize unless you use a disassembler to view the final code. This can happen if, for example, you unknowingly use a pseudo-instruction that expands into a sequence of multiple native instructions.

DSP processors (and many general-purpose processors) have specialized hardware or instructions to speed up common signal processing algorithms (such as filters and FFTs). These include, for example, single-cycle multiply-accumulates (MACs), specialized addressing modes (such as modulo and bit-reversed addressing), zero-overhead loops, and saturation.

If you're compiling signal processing application code, you'll need to figure out which (if any) of these instructions and hardware the compiler is capable of using, and under what circumstances. This will allow you to write your C code in a way that helps the compiler recognize opportunities to use specialized hardware features.

You can experiment with the C code and use the disassembler to observe the effect on the compiler's ability to create efficient object code. Each compiler has its own quirks, and it's worth the effort to spend some time learning how to help it do a good job.

Be careful with data types

If you're defining data types in C (rather than in assembly) it's important to understand how the compiler will implement them on your target processor, because this can have a significant effect on the efficiency of the compiled code.

The C standard defines several data types—but the sizes of these types are not standardized and differ from processor to processor. From a code performance perspective, the key thing to understand is that the size used by the compiler won't necessarily provide a good fit for the native data word width of the processor. For this reason, if you use the wrong data type in C, you may incur a huge penalty in the compiled code. If your processor only supports 16-bit integers, for example, you don't want to define data in your inner loop as 64-bit double.

The C data types are as follows:

  • int is the primary data type for indexing and counting.
  • long provides at least 32 bits (that's mandated by the C standard). On most processors where the native word is not 32 bits, "long" arithmetic requires library support.
  • long long provides at least 64 bits. This format is not supported on most 16-bit processors.
  • short is 16 bits on many processors, but not all.
  • char is the smallest addressable unit. Many C programs assume that a char is 8 bits, which can be problematic because on a DSP processor, it is usually not 8 bits. Also, note that the SIZEOF() operator in C returns size in units of char, which, again, may not be 8 bits.

Table 1 and Table 2 show int and char sizes for a selection of DSP processors.

char size

Processor

8

  ADI Blackfin

16

  ADI '21xx, TI 'C54, 'C55

24

  Freescale 56x

32

  ADI Blackfin, TI 'C6x

32

   ADI SHARC, TigerSHARC
Table 1. char data type sizes on common DSP processors.

 

int size Processor

16

  ADI '21xx, TI 'C54, C55

24

  Freescale 56x

32

  ADI Blackfin, TI 'C6x

32

  ADI SHARC, TigerSHARC
Table 2. int data type sizes on common DSP processors.

To further complicate matters, signal processing code that's implemented on fixed-point processors typically relies heavily on fractional data types, such as Q.15 in which a 16-bit word represents a fractional value that lies between -1 and 1. DSP processors are designed for efficient operations on fractional data—but ANSI C doesn't recognize fractional types. If you stick with ANSI C, you're likely to use integer data types and shifts to implement fractional arithmetic. But when the compiler encounters this, the resulting code can be extremely inefficient. To address this issue, many DSP processor compilers support fractional data types via C-language extensions (discussed further below).

Signal processing algorithms are often initially developed using floating-point data types, and then ported to fixed-point processors. If you specify a floating-point data type and the target processor doesn't natively support floating-point operations (as is true of most DSP processors), then the compiled code will emulate floating-point math in software—which is extremely slow. 

C language DSP extensions

Some compilers support DSP-oriented C language extensions. Typical extensions include support for DSP-oriented data types, such as fractional and complex data, and support for specifying multiple memory spaces, such as separate instruction and data memories. They may also support common DSP processor features such as modulo addressing.

Unfortunately, as yet these extensions are not standardized across vendors, so if you use them you sacrifice portability. (ISO has developed DSP-oriented extensions to C as part of "Embedded C," but Embedded C has not yet been widely adopted.) And in general, you'll need to supervise the compiler very closely to verify that the extensions behave as expected.

Using optimization switches

It's common in signal processing applications to find that optimizations that improve speed come at the cost of additional memory use. As a result, the programmer or compiler typically must decide how to trade off speed versus memory use. Most compilers allow the programmer to set compiler switches that govern how aggressively they want the code to be optimized, and whether to optimize for maximum speed or minimum code size.

Compiler switches are quite useful, but don't use them blindly. Directing the compiler to speed-optimize the entire application may speed up small sections of code when viewed in isolation, but slow down overall performance. How is that possible? If the code size is increased by the compiler's optimizations to the point where key portions no longer fit in L1 memory, the cost of repeated paging in the desired section (or thrashing the cache) may offset any localized gains. In practice, the programmer must profile the application and select appropriate compiler optimization levels on a file-by-file basis, balancing localized optimizations and overall image size relative to available memory.

Intrinsics

Intrinsics are meta-instructions that are embedded within C code and are translated by the compiler into a predefined sequence of assembly instructions. (Most intrinsics translate into a single assembly instruction, but occasionally you'll come across one that requires multiple instructions.) Using intrinsics gives the programmer a way to access specialized processor features without actually having to write assembly code.

Many compilers support intrinsics. If you use intrinsics it's important to verify that the compiler's output is as expected. For example, with one compiler, using a single intrinsic resulted in three assembly instructions: one add, and two instructions that set mode bits. Though the latter two instructions only needed to be performed once, they were placed within an inner loop alongside the add—resulting in a serious performance penalty.

Inline assembly

Many processor compilers support the use of inline assembly code, using the asm() construct within a C program. This feature causes the compiler to insert the specified assembly code into the compiler's assembly code output. Inline assembly is a good way to get access to specialized processor features, and it may execute faster than calling assembly code in a separate function. However, in some circumstances the use of inline assembly may adversely affect the performance of the surrounding C code,

The problem is that compiler optimizations often depend on the compiler "understanding" the intent of the code—and inline assembly can interfere with that process. For example, an inserted assembly instruction might store data to memory, so the compiler may have to assume that all variables could be modified by the in-line code. This can interfere with the compiler's ability to keep variables in registers.

Many compilers don't optimize code contained within an asm() statement. Furthermore, on some compilers, use of inline assembly disables most optimization in the C code surrounding the asm() statement.

Programmers sometimes insert inline assembly code that uses specific processor registers, and assume that this won't conflict with the compiler's register use. If a conflict does arise, however, it will almost never be detected by the compiler, and can introduce pernicious bugs. Unfortunately, interactions between inline assembly and the compiler are not always well-documented.

Accessing parallelism

Because many DSP algorithms are highly parallel, most processors intended for DSP can execute multiple operations or instructions in parallel. Unfortunately, C is by nature a sequential language, and as a result, compilers often have a difficult time recognizing opportunities for parallelizing operations.

Many DSP processors and high-performance general-purpose processors also use SIMD (single-instruction, multiple data) operations to improve their parallelism. Although SIMD is effective for speeding up signal processing code, it is difficult for compilers to use SIMD features well. Most compilers don't even try to use SIMD, instead leaving it to the programmer to use assembly code, intrinsics, or off-the-shelf software components for the inner loops where SIMD tends to be most useful.

Know your compiler

Using a compiler to create signal processing software requires a different set of skills than using a compiler for other types of code. Getting the compiler to produce good, efficient signal processing code requires a solid understanding of the compiler, its DSP-oriented extensions, and the target processor architecture—and how these things interrelate. For the best results, spend some time with a disassembler to get a feel for the compiler's capabilities, be aware of potential compiler pitfalls, and don't take anything for granted! 

Add new comment

Log in to post comments