Last summer, I wrote that the time was ripe for deployment of neural networks in mass-market applications.
Last week, Google validated this point of view by announcing that is has developed a specialized processor for neural networks (dubbed the "Tensor Processing Unit," or TPU), and that these processors have been deployed in Google's data centers for more than a year.
What's the significance of this? I believe that Google's recent statements validate three important points that help to explain why deep learning will have an enormous impact in the next decade.
First, Google's disclosures make it clear that deep learning has made the transition from an exotic research technology to a deployed, commercial technology. Google has stated that it is relying extensively on deep learning for commercial products, including Google Translate, Google Photos, and Google search rankings. This is a key part of why it makes business sense for Google to invest the tens of millions of dollars required to develop a new chip.
Second, Google's disclosure of the Tensor Processing Unit validates that it is possible to gain an order of magnitude in processing efficiency (throughput per dollar and throughput per watt) for deep learning by using specialized processors. This is no surprise; processor designers have shown time and again over the past several decades that – for workloads that are regular and parallelizable – specialized architectures yield big efficiency gains. And neural network algorithms are very regular and very parallelizable. (Until recently, system designers could count on processors becoming more efficient simply due to improvements in silicon fabrication technology. Today, those gains are diminishing, and as a result, improvements in processor efficiency due to specialized parallel architectures are becoming more important.)
Third, Google has demonstrated that deep learning is a very versatile technology, applicable to diverse problems from speech recognition to web search ranking to image recognition. In his keynote presentation at the recent Embedded Vision Summit, Jeff Dean stated that the number of Google products and projects using deep learning has grown exponentially in the past three years, recently exceeding 1,000.
These three points set the stage for the emergence of a "virtuous circle" for deep learning technology – one that reminds me of the role that the fast Fourier transform (FFT) has played in digital signal processing. When specialized digital signal processors emerged in the 1980s, the FFT was known to be an important algorithm, and these processors included instructions and addressing modes that accelerated FFT implementations. The fact that these processors executed FFTs efficiently motivated engineers to find new ways to use the FFT. And this more widespread use of the FFT then drove processor designers to make additional FFT-specific improvements in their processors (some even added dedicated FFT co-processors). Today, the FFT is ubiquitous in audio, communications, and many other signal processing applications.
I believe we're at the start of a similar constructive feedback loop for deep learning. Now that deep learning has been shown to be valuable for many important commercial applications, specialized processors are emerging that provide a big boost in efficiency on deep learning algorithms. This, in turn, will spur use of deep learning in new applications. And the cycle will continue.
And, now that the commercial relevance of deep learning is clear, we can also expect to see significant improvements in the efficiency of the algorithms themselves. When deep learning was mainly a research topic, little effort was invested in finding efficient ways to deploy these algorithms. Now that deep learning is being deployed at scale, many companies are developing more efficient deep learning algorithms. A simple example of this is the transition from 32-or 64-bit floating-point math (the dominant data types in deep learning research) to 8-bit integer math (the focus of Google's new processor). Many other algorithmic improvements are emerging, and in aggregate algorithm improvements will yield at least an order of magnitude improvement in deployed deep learning applications – independent of the improvements at the processor level.
Combining algorithm improvements and processor improvements, we can expect to see at least a 100x improvement in the efficiency of deep learning implementations in the near future. This will surely enable new applications, which in turn will drive further improvements in chips and algorithms.
The modern FFT was published in 1965. 30 years later, it had become an indispensable technology. How long will it take for deep learning to achieve the same status?