The phrase "garbage in, garbage out" is usually associated with writing and using computer programs. In fact, the concept originated with Charles Babbage, inventor of the first computer. Amusingly, Babbage wrote:
On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?'...I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
The concept of "garbage in, garbage out" applies to benchmarks as well. My colleagues and I at BDTI have been evaluating and comparing processors for 20 years. When briefing us about a new processor, suppliers often show us benchmark results as evidence of their product's superiority. Too often, though, these results don't stand up to scrutiny.
Sometimes bad benchmarks are deliberately chosen – or good benchmarks are misused – to support a point of view. A common pitfall in this category is "cherry picking" – selecting a few benchmarks that show the product besting competitors, while ignoring others that show the opposite. Frequently, though, bad benchmarks result from misunderstandings about how benchmarks should work.
If your objective is to create (or select) a good benchmark, I find it's helpful to think of benchmarking as a type of scientific experiment. Scientists testing a hypothesis take great care to design experiments that yield valid, reproducible results. Similarly, engineers seeking meaningful benchmark results must thoughtfully design their methodology. While it's tempting to simply grab some readily available code and start taking making measurements, this rarely yields valid (and credible) results.
The most obvious aspect of benchmark design is the selection of workloads – the applications or functions that make up the benchmarks. Needless to say, the benchmarks must be chosen to be representative of the target applications. But this is often easier said than done. For example, there's a natural tension between the desire to make benchmarks small, so they're easier to use, and making them large enough to be truly representative of real applications. Similarly, there's a natural tension between selecting a small number of benchmarks – again, to reduce effort – and using a sufficient number of benchmarks to represent all of the important aspects of the target applications.
But beyond the selection of workloads, there are many other critical aspects to the design of good benchmarks. Some of these are fairly obvious, while others can be quite subtle. It's fairly obvious, for example, that if benchmarks are written in a high-level language, careful selection compilers and compiler settings is an important aspect of the methodology. If there are multiple compilers available for a given processor, which should be used? Should the latest version of each compiler always be used? Should the compiler settings be uniform for all compilers, or should they be set to obtain the best results from each compiler?
Some particularly sticky benchmark questions arise when designing benchmarks for digital signal processing applications. Digital signal processing, of course, consists of algorithms. In product development, DSP engineers often modify algorithms (or even substitute different algorithms) in order to obtain code that runs efficiently on a particular processor. In light of this, it would seem that DSP benchmarks should allow (or even require) benchmark implementers to modify benchmark algorithms in order to get the best performance from a processor. But this can be a Pandora's box: if the benchmark methodology permits algorithm changes, how far can those changes go? Can a time domain algorithm be substituted for a frequency domain algorithm if they yield similar results?
The correct answers to questions like these depend entirely on the goals of the benchmarking process – analogous to the hypothesis being tested by a scientific experiment. It can be tricky to answer such questions correctly, but doing so is critical: a poorly designed benchmark is more likely to confuse than inform.
Have you seen examples of particularly bad – or good -- benchmarks? I'd love to hear about them. Post a comment here or send me your feedback at http://www.BDTI.com/Contact.