QDSP6 V4: BDTI Benchmark Results and Implementation Details Of Qualcomm's DSP Core

| | Write the first comment.

The article, "QDSP6 V4: Qualcomm Gives Customers and Developers Programming Access to its DSP Core," which appeared in the June 2012 edition of InsideDSP, showcased Qualcomm’s decision to open up access to its DSP core via a software development kit. This decision corresponded with the release of the fourth version (V4) of the sixth generation (QDSP6, aka "Hexagon") of the company's proprietary DSP architecture, found in the company's 28 nm-based Snapdragon S4 SoCs.

To be clear, this broadened access applies only to instances of the QDSP6 core used as a coprocessor for multimedia tasks. QDSP6 cores used to implement cellular modem functions remain programmable only by Qualcomm. Nonetheless, Qualcomm's estimates documented in the June 2012 article suggested that substantial performance-per-watt improvements could be accrued by handing off appropriate portions of relevant multimedia algorithms to the DSP core. And BDTI has already leveraged its expertise as a QDSP Access Program premier member to provide software optimization services to multiple clients.

In the earlier article, I wrote, "To date, Qualcomm has not provided extensive public disclosure on QDSP6 architectural details." And in conclusion, I noted, "BDTI is currently running benchmark studies on QDSP6 V4 and will analyze the comparative results versus already-published QDSP6 V2 metrics in an upcoming issue of InsideDSP. It will be interesting to see how the V2-versus-V4 results on BDTImark2000 stack up against the comparative performance claims that Qualcomm has already published for these QDSP6 generations." That time is now. And along with the benchmark results, I'll discuss specifics of the Hexagon architecture design, both in an absolute sense and as it has evolved over time.

Table 1 summarizes these generational jumps:

Version

Process lithography

Peak number of simultaneous threads

Per-thread clock speed (initial instantiation)

Cumulative per-core clock speed (initial instantiation)

V2

65 nm

6

100 MHz

600 MHz

V3

45 nm

6

67 MHz

400 MHz

V3 (2nd gen.) 45 nm 4 100 MHz 400 Mhz

V4

28 nm

3

167 MHz

500 MHz

Table 1. QDSP6 version key attributes

The V2-to-V3 version transition predominantly involved a lithography shrink; the DSP core remained essentially unchanged aside from comparatively minor microarchitecture improvements to optimize power consumption (a four-thread, 100 Mhz per-thread, 400 Mhz cumulative speed V3 variant was also later offered). The broader V2-to-V4 transition, in contrast, involved not only a two-generation process migration, but also a halving of the number of threads supported by the core. This more notable differentiation explains why BDTI and Qualcomm focused their benchmarking attention on QDSP6 V2 and V4, bypassing the V3 intermediary.

Hexagon's multithreaded nature is a key differentiator from competitive DSP architectures developed by companies like CEVA, Tensilica and other suppliers. Each QDSP6 thread is fully supported in hardware, with a distinct register file, and execution of the threads operates in a round-robin fashion, on a cycle-by-cycle basis. For example, here's the best-case execution sequence for the three-thread QDSP6 V4 core:

  • First clock (cycle): First instruction for first thread
  • Second clock: First instruction for second thread
  • Third clock: First instruction for third thread
  • Fourth clock: Second instruction for first thread
  • Fifth clock: Second instruction for second thread
  • Sixth clock: Second instruction for third thread
  • Etc.

Threading in QDSP6 V2 is very similar to V4, but V2 sequentially executes six threads. Why did Qualcomm decrease the multithreading potential of QDSP6 from six to three in the V2 to V4 migration? According to Willie Anderson, VP of Engineering at Qualcomm, "This makes it easier for the programmer to use the full performance bandwidth of the QDSP6." In subsequent conversation, Anderson explained that application code rarely was able to leverage all six threads' worth of execution potential in QDSP6 V2, a scenario likely also familiar to programmers attempting to make optimum use of multi-core processors. Therefore, Qualcomm decided that with QDSP6 V4, fewer threads, each running at higher effective speed, made more sense, particularly considering that the emphasis on power efficiency put a cap on the DSP core's peak clock rate.

Anderson explained that whereas in QDSP6 V2, Qualcomm used the six-thread architecture to effectively "hide" six cycles' worth of pipeline latency, in V4, the company decided to only conceal the latency of the execution unit (among other things, handling interlock conditions where one instruction is waiting for a previous instruction's output) – not loads, branches, or instruction decodes. All other things being equal, increasing the (non-hidden) pipeline latency will ordinarily introduce stalls and therefore make code run slower, at least in cycle count terms. However, Anderson claimed that the QDPS6 compiler is able to efficiently work around such pipeline latencies, thereby enabling Qualcomm to expose additional potential latency in V4 versus V2 without slowing down the resultant code.

"We were hiding more of the pipeline than we needed to in order to achieve back-to-back, single-cycle instruction execution," he said. "In V2, we had more threads than it turned out we needed, so we reduced the number of threads in V4. If a block of code runs in N cycles on V2, it will run in <=N cycles on V4." Key to the de-emphasis of "hiding" control instruction latency was V4's decreased clock cycle count penalty for L1 cache "misses", and therefore L2 cache accesses. And BDTI's benchmark testing of both QDSP6 V2 and V4 (PDF) confirms the latter's more efficient handling of control code in particular.

Table 2 documents the benchmarking results summary:

Processor Family

Clock rate (min-max)

BDTImark2000TM, BDTIsimMark2000TM scores (min-max)

QDSP6 V2 (one thread)

67-100 MHz (per thread)

1040–1550 (one thread)

QDSP6 V2 (six threads)

67-100 MHz (per thread)

6240–9300 (projected best case for 6 threads)

QDSP6 V4 (one thread)

100–233 MHz (per thread)

1810–4220 (one thread)

QDSP6 V4 (three threads)

100–233 MHz (per thread)

5430–12660 (projected best case for 3 threads)

Table 2. BDTImark2000TM and BDTIsimMark2000TM Speed Metric Scores for Fixed-Point Packaged Processors (Higher is Better)

Note that BDTI does not freely publish individual scores for each of the code kernels (which fully run out of L1 cache) that in aggregate form the BDTImark2000 benchmark result. With that qualifier noted, the BDTI engineer who conducted the benchmarking study comments on the outcome, "As can be calculated from the results, V4 is about 17% more efficient that the V2 per-thread at the same clock speed." This consequence mainly occurs due to more efficient handling of control code. For the 12 benchmark kernels in this benchmark suite:

  • Four kernels have the same cycles counts on QDSP6 V2 and V4
  • One kernel is significantly better on V4, due to the addition of a new V4 execution unit instruction
  • The rest of the kernels are between ~5% and ~25% better on V4 than V2, mostly due to the improved handling of control code"

In Table 2, note that the six-thread (V2) and three-thread (V4) results are projected, not measured (i.e. not "official") and assume full utilization of the core's multi-thread potential. From Anderson's earlier comments, we already know that such high utilization is more easily achievable on QDSP6 V4's three-thread approach versus with the six-thread V2 precursor. Even with three threads, though, using 100% of the chip’s performance by fully loading three threads may be challenging.

Note, too, the higher per-thread effective clock rate on V4 versus V2. Whereas Table 1 lists the per-thread effective clock speed of the initial core instantiation (that is, the first version of the core in a shipping Qualcomm chip), Table 2's clock speed spreads encompass the proliferation of instantiations over time, across numerous chips, ranging as low as 67 MHz (V2) and as high as 233 MHz (V4). The BDTImark2000 results suggest that the V2-to-V4 implementation tradeoff which Qualcomm made - fewer threads, but each running at much higher effective clock rate - produced a net gain in overall DSP core performance.

Qualcomm is, of course, not done innovating with Hexagon. The June 2012 InsideDSP article uncovered evidence of an upcoming QDSP6 V5, which the company officially unveiled at the Consumer Electronics Show last month within its newest Snapdragon 800 Series SoCs. QDSP6 V5 expands on the V4 foundation in several notable areas:

  • Byte-vector operations, a particularly appealing enhancement in embedded vision and imaging applications, which generally require less data precision than audio (for example) and benefit from more operations per second, and
  • Floating-point support, which helps with high dynamic range audio and location (GPS, etc.) processing applications, among others, and which also simplifies the initial porting of floating point-based software that originates with PC-developed code.

Stay tuned for more on QDSP6 V5 in a future edition of InsideDSP, commensurate with BDTI's publication of benchmarking results on the new architecture generation.