The December 2012 edition of InsideDSP included the article "Texas Instruments' Latest KeyStone II SoCs: Is A Special-Purpose Server Strategy Feasible?," which discussed TI's 66AK2Hx SoCs for specialized server applications. Based on the company's KeyStone II architecture, 66AK2Hx chips include the same ARM Cortex-A15 cores (one, two or four per chip) plus C66x DSP cores (zero, one, four or eight) as other KeyStone II family devices such as the cellular base station-tailored C6636 introduced earlier that same year. However, befitting the 66AK2Hx's intended use in high-end networking (routers, switches, etc), enterprise and industrial, and special-purpose server applications, the 66AK2Hx integrated peripheral mix is different; it includes Ethernet packet acceleration engines, along with 1 GbE and 10 GbE MACs, and USB3 and other computer-oriented system interfaces (Figure 1).
Figure 1. Texas Instruments' 66AK2Hx KeyStone II SoCs target high-end networking, specialty server and other digital signal processing-centric applications.
At the time, BDTI was cautious about the products' chances, writing:
The DSP cores’ limitation to 32-bit single-cycle operations likely won’t be a major concern for the applications that Texas Instruments is targeting; "cloud"-based multimedia processing, gaming, video analytics, radar, and the like. However, the KeyStone II SoCs' 32-bit CPU cores may be a bigger concern. Currently, conventional servers typically use 64-bit AMD and Intel CPUs, running 64-bit operating systems (Windows, Linux/Unix, and Mac OS X) and applications. The transition from the x86 to the ARM instruction set will be challenging enough for system developers and users alike. A further downgrade from 64-bit to 32-bit processing may be an unacceptable tradeoff, even considering the power consumption savings.
Two years later, Texas Instruments has secured a notable design win at a systems supplier, who has itself captured a prominent end customer. Hewlett-Packard, the current market share leader in servers per at least some analysts' reports, has since early 2013 been unveiling versions of its latest-generation Moonshot blade server architecture. Highly modular in comparison to past HP system designs, Moonshot encompasses diverse subsystem options: CPUs, memory, mass storage, I/O, power supplies, cooling, etc. As explained by the company's VP of Server Engineering Tom Bradicich, and Engineering Manager Harvey White during a recent briefing, Moonshot is intended for special-purpose workload optimization, versus the general-purpose functions handled by HP's ProLiant "mother" brand.
Even within the Moonshot line, most of the product options you'll find are powered by either AMD or Intel x86 CPUs, the latter in both Atom and Xeon variants. Using Chris Anderson's (former editor-in-chief of Wired) book The Long Tail as inspiration, Bradicich and White equate HP's mainstream server business to the McDonald's fast food chain (i.e. the "head")...few product options, low per-unit profit margin, but huge business volumes. But, they feel, an opportunity also exists for a server business analogous to Subway sandwich shops (the "tail"): per-customer customization, lower volumes, but higher per-unit margin.
HP intends to service this latter segment of the server market with specialty products, enabled by Moonshot's modularity. And the company announced two of them, both ARM-based, at the end of September. The first, the ProLiant m400, addresses BDTI's earlier concerns about 32- vs. 64-bit processing. The ProLiant m400 is based on Applied Micro Circuits' ARMv8 64-bit X-Gene processor, along with Canonical's Ubuntu operating system.
The second, the ProLiant m800, is the primary focus of this article, and is based on the Texas Instruments 66AK2H12, which incorporates four-ARM Cortex-A15 cores and eight-C66x DSP cores, and runs Canonical Ubuntu (Figure 2). Each ProLiant m800 blade (which HP calls a "cartridge") contains four 66AK2H12s; a proprietary form factor "4.3U" (7.5" tall) chassis comprises up to 45 cartridges. This translates into up to 180 66AK2H12 SoCs per chassis, corresponding to 720 ARM CPU cores and 1,440 TI DSP cores per chassis. Each rack assembly can hold up to 45 chassis.
Figure 2. Each HP MoonShot 1500 chassis (top), in the ProLiant m800, contains 180 Texas Instruments 66AK2H12 SoCs comprising 720 ARM Cortex-A15 CPU cores and 1,440 TI C66x DSP cores
What will customers do with all of this CPU and DSP processing power? Bradicich and White point, for example, to Swoop Search, a search and analytics engine that "allows users to comb for relevant information, either from data or from the web, in a graphical and intuitive manner which helps uncover hidden relationships and insights that would have previously been missed." See below for a TI-supplied video that describes Swoop Search in greater detail:
Bradicich and White note that while some customers will harness the DSP cores using software provided by specialized HP partners like Swoop others will write their own DSP software. One notable example of this latter approach is PayPal, which is using the TI 66AK2H12-powered ProLiant m800 to tackle real-time system fault detection and response as part of the company's Systems Intelligence initiative, with other applications (such as fraud detection) in the planning stages. The following video captures a presentation on System Intelligence from PayPal Advanced Technology Group architects Arno Kolster and Ryan Quick, delivered at the mid-September 2014 HPC User Forum in Seattle, Washington.
Kolster and Quick use the analogy of a music concert to describe what Systems Intelligence accomplishes. Just as your mind's signal processing capabilities are able to immediately detect a poorly played note in the music performance, PayPal's intention with Systems Intelligence is to use TI's digital signal processing to detect (and appropriately respond to) a system fault scenario. In implementing this aspiration, PayPal combines numerous data sets sourced from its worldwide server network:
- Information about just-"pushed" firmware, operating system and application updates to various servers
- Dynamic server statistics: operating temperature and power consumption, CPU, memory and mass storage loading, etc., and
- Location-specific social media feeds (Twitter, Facebook, etc) that might alert PayPal to problems via customer complaints
Systems Intelligence, according to Kolster and Quick, accomplishes this real-time monitoring and response objective by, in effect, transforming each text data stream into a sine wave-based signal (tagged to enable subsequent back-reference to the appropriate text source), and then simultaneously processing all of the signals on the C66x DSPs in order to detect anomalies (analogous to the earlier-described "bad note"). The entire video is fascinating and highly recommended. Note, particularly, the following segments:
- At 12:10, where Quick tells the story of when he and Kolster got the initial brainstorm for Systems Intelligence's signal-based real-time data processing approach in the midst of a HP MoonShot briefing which ironically had de-emphasized the ProLiant m800 product, and
- At 17:25, where Quick reveals the system's performance results: 55W/cartridge power consumption (each cartridge again containing four 66AK2H12s comprising 16 ARM Cortex-A15s and 32 C66x DSPs), and 11.2 GFLOPS/W of aggregate system performance.
It's difficult, particularly at this early stage, to imagine DSP-accelerated servers becoming mainstream. Reflective of this reality, HP has positioned the ProLiant m800 as a boutique specialty offering. Then again, two years ago when TI briefed BDTI on the 66AK2Hx product line, it was difficult to imagine a 32-bit ARM-plus-DSP SoC gaining even this much computing traction. PayPal is to be commended for its clever use of the chip's real-time signal processing facilities; it'll be interesting to see what Kolster and Quick (as well as their engineering equivalents at other companies) come up with in the future. And it'll be equally interesting to see what HP has up its sleeve in terms of future Moonshot offerings, both on the 32- and 64-bit ARM vs x86 CPU front, and with regard to algorithm acceleration via DSPs, FPGAs, GPUs and other more specialized processors.