In recent months, evidence has continued to mount that artificial neural networks of the "deep learning" variety are significantly better than previous techniques at a diverse range of visual understanding tasks.
For example, Yannis Assael and colleagues from Oxford have demonstrated a deep learning algorithm for lip reading that is dramatically more accurate than trained human lip readers, and much more accurate than the best previously published algorithms.
Meanwhile, Andre Esteva, Brett Kuprel and colleagues at Stanford described a deep learning algorithm for diagnosing skin cancer that is as accurate as typical dermatologists (who, in the U.S., complete 12 years of post-secondary education before they begin practicing independently).
Even for tasks where classical computer vision algorithms have been successful, deep learning is raising the bar. Examples include optical flow (estimating motion in a sequence of video frames) and stereo matching (matching features in images captured by a pair of stereo cameras).
It's now clear that deep learning is a critical technique for visual understanding. And, as I've written in past columns, I believe that visual understanding will become a key capability for many (perhaps most) devices and systems – enabling them to be safer, more autonomous, more secure and more capable.
For me, the key question this leads to is this: Is it feasible to deploy deep learning in cost- and power-sensitive systems? This question is particularly pertinent given the huge computation requirements of typical deep learning algorithms.
For some applications, the cloud is a natural solution, and many providers of cloud compute services – including Amazon, Google and Microsoft – offer APIs for tasks like object classification.
But for many applications, relying on the cloud isn't feasible. For example, some applications need maximum reliability and minimum latency (think automotive safety applications), while in others, the cost of moving video data to the cloud is prohibitive (think security systems for small businesses).
For applications requiring local, embedded implementation of visual understanding, today it can be quite challenging to implement deep learning algorithms with acceptable performance and power consumption. But this is changing quickly, due to three factors.
First, many processor designers are creating processors specialized for deep learning. Indeed, a recent BDTI survey turned up over 40 such companies! Given the repetitive and highly parallel structure of deep learning algorithms, specialized processor architectures can provide big gains in performance and efficiency.
Second, now that the vast opportunity for deploying deep neural networks is coming into focus, algorithm developers are beginning to work in earnest on ways to make these algorithms less computationally demanding, including using smaller data types (down to one or two bits per coefficient in some cases) and modifying network training procedures to yield less resource-intensive networks.
Finally, there's been rapid progress in software libraries and frameworks to facilitate efficient implementation of deep neural networks. NVIDIA was a pioneer in this space, and many others have followed suit.
It's quite conceivable (I would say likely) that in the course of the next year or two, we'll see roughly a 10X improvement in cost-performance and energy efficiency for deep learning algorithms at each of these three layers – algorithms, software techniques, and processor architecture. Combined, this means that we can expect something on the order of a 1,000X improvement. So, tasks that today require hundreds of watts of power and hundreds of dollars' worth of silicon will soon require less than a watt of power and less than one dollar's worth of silicon.
This will be world-changing, enabling even very cost-sensitive devices, like toys, to incorporate sophisticated visual perception. If this sounds farfetched, consider that a few decades ago, digital audio similarly required expensive, specialized equipment, while today, birthday cards incorporate digital audio chips.
Given the enormous potential for deployable deep learning, I'm very excited that several leaders in efficient implementation of deep learning, including Google's Pete Warden, Purdue University's Eugenio Culurciello, and Magic Leap's Andrew Rabinovich, will be sharing their expertise at the 2017 Embedded Vision Summit, taking place May 1-3, 2017 in Santa Clara, California. My colleagues and I at the Embedded Vision Alliance are putting together a fascinating program of presentations and demonstrations, with emphasis on deep learning, 3D perception, and energy-efficient implementation. If you're interested in implementing visual intelligence in real-world devices, mark your calendar and plan to be there. As a bonus, if you register for the Summit on the Alliance website by March 15 you can save 15% with discount code NLID0216. I look forward to seeing you in Santa Clara!