Convolutional neural networks (CNNs) and other "deep learning" techniques are finding increasing use in a variety of detection and recognition tasks: identifying music clips and speech phrases, for example, and finding human faces and other objects in images and videos. As a result, we’ve been covering deep learning concepts and implementations regularly in InsideDSP columns and news articles. Chris Rowen, Chief Technology Officer of Cadence Design Systems' IP Group, will be speaking about the future of neural networks at the upcoming Embedded Vision Summit. BDTI sat down with Rowen to hear his perspective on the role that deep learning will play in digital signal processing. Rowen (who is an IEEE Fellow, was a co-founder of MIPS, and was the founder of Tensilica) has a long history in digital signal processing.
BDTI began by asking Rowen to compare and contrast the traditional digital signal processing and emerging deep learning approaches, in particular in the context of computer vision applications. He answered by first giving a high-level definition of digital signal processing: a system of computation based on linear algebra in various forms. Such a definition works well for the image-processing portion of a computer vision pipeline, as well as for some vision processing functions, such as image enhancement to bring out various features, or optical flow. But what about, for example, a Haar Cascade used to pick out a particular feature? While it can be written in MATLAB, such an algorithm is a step (or a few steps) away from a linear algebra frame of mind. Instead of a classical signal-processing problem, it's more of a general computing problem; a program that is run, a procedure, not just an equation.
The distance from classical digital signal processing increases, according to Rowen, when we move beyond "classical" computer vision algorithms to newer deep learning approaches. When we examine the raw computations comprising a convolutional neural network, they look very much like digital signal processing. However, it's difficult to explain in digital signal processing terms why one deep learning network performs better than another. While equations clearly describe what CNNs are doing, there's not much theory to explain why they work. Recognition is not a well-defined mathematical concept; you can't reliably ascertain solely via mathematics how many operations per second are needed for a given-sized input to deliver an answer of sufficient quality. Human cognition, after which deep learning is loosely modeled, is similar: "what makes a dog" cannot be mathematically modeled in a closed-form equation.
So then, BDTI asked, what traditional digital signal processing tasks will prove to be better solved using deep neural networks? Will the traditional diversity of highly customized algorithms (for speaker identification and speech recognition, for example, or various object recognition tasks) end up being replaced by neural networks? Rowen answered by first categorizing deep learning as the "shiny new tool" that engineers are attempting to apply to all sorts of problems, much as a four year old given a hammer will use it to pound on everything around him or her. With that said, Rowen acknowledged, deep learning so far seems to be remarkably effective in a variety of applications – it is an impressive hammer. He used the example of weather forecasting: give a neural network lots of historical weather data, ask it to solve for tomorrow's weather, and it'll probably do a pretty good job even though it doesn't comprehend fluid dynamics, heat transfer, or any other physics. Physics gives us great tools, but complex systems like the weather require significant simplification in order to use these tools, and valuable information is discarded in the process.
Given the limitations of scientific models (of weather, visual perception, etc.), should we forget about these models and be satisfied with techniques that can predict, without explaining underlying mechanisms? Definitely not, said Rowen. There are plenty of things we know, especially at the "micro" level, that are highly predictable via physics even if they seem complex to a casual observer and are difficult to capture just by observation. Take electromagnetic radiation, for example; you might learn simple things about light and shadow solely from a set of example images, but concepts such as diffraction gratings, the interactions between waves, and the ability to communicate information at a certain rate cannot be fully comprehended solely in this manner. Instead, you need to leverage physics at the micro level in order to get to core understandings of such behavior. And of course, micro behaviors have both micro and macro effects. Conversely, neural networks begin with macro level observations in attempting to understand underlying mechanisms. Generally, Rowen believes that what we'll see going forward is not the discarding of 2500 years' worth of physics insights, but a partitioning of any particular task to leverage classical and deep learning techniques. Deep learning and classical digital signal processing are not opponents of each other, but ultimately partners.
Rowen used the example of AlphaGo, the Google-developed CPU- and GPU-based computer that recently dominated a five-game Go competition against Lee Sedol (one of the top human Go players in the world), to further illustrate the likely future interplay between classical digital signal processing and neural networks. At its heart, AlphaGo is a neural network, which was trained by feeding it a set of recorded historical Go games comprising approximately 30 million moves in aggregate. However, the algorithm used encompasses both machine learning and Monte Carlo tree search techniques, the latter assigning values to various positions and moves. AlphaGo's developers broke the Go problem down because they knew something of the underlying structure of the decision process.
And how does deep learning affect processor architecture, and usage? Fortunately, Rowen responded, neural networks are pretty predictable in their patterns of computation and memory access. In general, he believes we'll see evolutionary adaptation of existing processor architectures to incrementally optimize for deep learning needs; more revolutionary architecture changes are more difficult and take longer to accomplish. With that said, Rowen doesn't necessarily believe that convolutional neural networks (which are well suited for today's processor architectures) will necessarily continue to be the sole approach in the future.
Neural networks are quite far from being standardized, said Rowen, leading to a long-term need for programmable processors. Standardization is necessary when transmitter and receiver compatibility is essential, such as in multimedia encoding and wireless communications. Conversely, computer vision (for example) encompasses a diverse set of experiences on a diverse set of data, and evolution is quite fast. Therefore, the idea of standardizing on a single fixed neural network is implausible. And without standardized algorithms, highly specialized, hard-wired processing engines don’t make sense.
Regardless of whether computer vision is implemented using fixed-function or programmable hardware, however, it will grow to be a dominant consumer of processor cycles, energy, memory bandwidth, and programmer labor into the future, for two reasons. First, vision is inherently computationally demanding compared to comparable tasks on other types of data. When doing speech phrase recognition with neural networks, for example, 1 million MACs per frame is considered a significant processor load. For imaging and vision, on the other hand, 1 billion MACs per frame is a low-end processing load. Second, Rowen believes that computer vision can be applied to many problems, and the resulting experiences can be quite compelling. (He used the example of Google Translate, which he recently tried for the first time and found very impressive.)
For more insights on deep learning, plan to attend the Embedded Vision Alliance's Embedded Vision Summit, taking place May 2-4 in Santa Clara, California. The Embedded Vision Summit is an educational forum for product creators interested in incorporating visual intelligence into electronic systems and software, and Rowen's presentation "The Road Ahead for Neural Networks: Five Likely Surprises" is part of the Deep Learning Day on May 2. Register now, as space is limited and seats are filling up!