In many applications of digital signal processing – such as speech recognition and computer vision – the essential objective is to distinguish objects of interest, such as words or faces. This can be very challenging in real-world situations where objects of interest are distorted (e.g., a person is speaking with an accent, or a face is turned at an angle from the camera) or obscured (for example, a voice is competing with background noise, or a face is partially covered by a hand).
Over the past 50 years, in academia and in industry, hundreds of thousands of man-years have been invested in developing solutions to these kinds of problems. Overall progress on many of these problems has been significant, but the pace has often seemed very slow. For example, 20 years after the first commercial speech recognition products emerged, speech recognition remains a niche technology which most people use very rarely, if at all.
Recently, however, researchers and product developers have been achieving rapid progress on many of these recognition problems using artificial neural networks. Artificial neural networks themselves are not new, tracing their conceptual origins to the 1950's and '60's. But several factors have converged to make artificial neural networks a practical – and in some cases compelling – technology today.
The first factor leading to the growing prominence of neural networks is a common enabler in such shifts: computing power. It turns out that, to solve difficult problems, artificial neural networks often must be quite large – encompassing perhaps a million neurons. Executing such large networks at speeds sufficient to enable real-time responses requires a level of processor performance that has only recently become available -- particularly in embedded and mobile devices. And even greater computing power is required for training neural networks. Training is typically done today using dedicated high-performance server farms incorporating large numbers of graphics processing units (GPUs) – not for generating graphics, but for performing parallel math calculations.
The centrality of the training process is one of the distinctive characteristics of neural networks. As I wrote in my earlier column on neural networks last year:
Such networks do not execute programs. Instead, their behavior is governed by their structure (what is connected to what), the choice of simple computations that each node performs, and a training procedure. So rather than trying to distinguish dogs from cats via a recipe-like series of steps, for example, a convolutional neural network would be taught how to categorize images through being shown a large set of example images.
This training process often requires huge amounts of data. For example, for a network being trained to recognize vehicles, tens or hundreds of thousands of images would be used, showing different types of vehicle, different perspectives, different settings, different lighting, etc. The ready availability of large quantities of example images (or speech samples, etc.) on the Internet is a second key factor enabling the emergence of artificial neural networks as a practical technology today.
But even with the massive amounts of data available today, designers of neural networks hunger for more data – because more data enables more training, which creates more effective networks.
In their recently published paper (PDF), Ren Wu and his team at Baidu describe a clever technique they developed to augment available training images: They created numerous modified versions of each available image (for example, rotating, scaling, and changing the color cast). In this way, they multiplied the number of images available for training their network by a factor of 10,000 or more. This has helped Ren's team to create leading-edge neural networks for image recognition.
Ren and his team are not only at the forefront of designing more sophisticated neural networks; they are also deploying these networks in the cloud to enable functions like image search for consumers. And they are pioneering the implementation of neural networks on mobile devices. Given the huge potential impact of this work, I am thrilled that Ren will be one of the keynote speakers at the Embedded Vision Summit conference that I'm organizing, which will take place on May 12 in Santa Clara, California.
The Embedded Vision Summit is a unique conference for product developers who are creating more intelligent products and applications by integrating computer vision into their designs. Join us at the Summit to hear Ren's story about how Baidu is advancing image recognition both at the conceptual level and in practical deployments. And at the Summit, you'll have the opportunity to hear dozens of other top-notch presentations providing expert insights and practical know-how for integrating visual intelligence into your products. I hope to see you there!
Please visit BDTI's Computer Vision Design Services page to learn more about our convolutional neural network (CNN) expertise and capabilities.