Jeff Bier’s Impulse Response—Making Benchmarks Useful

A few weeks ago I participated in a panel discussion on benchmarking. The theme of the panel was how to benchmark multi-threaded and multi-core processors. In my view, this theme highlights a key problem with many benchmark approaches: too many benchmarks are designed to exercise hardware features, rather than to provide information that system developers need.

In most embedded applications, system developers care about high-level system attributes such as low cost, long battery life, and high throughput. System developers generally don’t care how a solution delivers these top-level attributes; they only care about how well it does so. Therefore, benchmarks should be built from the top down, based on application requirements—and not from the bottom up, based on preconceived notions of what sorts of hardware will be used in the application.

More generally, benchmark designers should avoid making assumptions about the hardware whenever possible. In many embedded applications—such as the signal processing applications my company focuses on—system designers have great latitude to select among very different kinds of processing engines. A benchmark designed with one of these classes of hardware in mind is unlikely to give valid results for the other classes of hardware. Therefore, a benchmark that makes unnecessary assumptions about the hardware will have limited utility.

In fact, the need for flexibility extends beyond accommodating all of the hardware options. To be truly relevant, a benchmark must also accommodate the full range of implementation techniques used in the application. For example, developers of signal processing applications almost never build an entire application with plain C code. Even the best compilers sometimes produce very inefficient code, and when they do, a skilled programmer can often make vast improvements with modest effort—perhaps by modifying the C code, or perhaps by replacing portions of it with assembly code. As a result, a benchmarking approach that relies on plain C code is unlikely to produce useful performance data for signal processing applications.

Another problem with using C—or any other programming language, for that matter—is that this approach isn’t appropriate for solutions like FPGAs, which do not run “software” in the traditional sense. Of course, there are many applications where the best approach is to use an FPGA or other solution that doesn’t rely on software. Hence, benchmarks should avoid narrowly specifying the implementation methodology to avoid excluding relevant hardware options.

In summary, setting out to create benchmarks for a specific implementation approach—such as using multi-threaded processors or using plain C code—is going about things the wrong way. Instead, benchmarks should model the application requirements, and leave the implementation approach (multiprocessing, multithreading, reconfigurable hardware, hardwired solutions, or what have you) to the benchmark implementer—just as actual applications allow for many different implementation approaches.

Benchmark developers who find themselves designing benchmarks to show off particular features of particular kinds of hardware should ask themselves whether the results are really going to be meaningful. And system designers need to understand the design of a benchmark before accepting and using the results that the benchmark produces. If a benchmark assumes a specific hardware feature or a specific implementation methodology, system designers should proceed with caution.

Jeff Bier’s Impulse Response—Making Benchmarks Useful

Add new comment