In 2011, I spoke with Professor Jitendra Malik, a distinguished computer vision researcher and teacher at U.C. Berkeley. Professor Malik, tongue-in-cheek, remarked that he is frequently frustrated when trying to explain his work to non-technical people. In their minds, his research often sounds like an awful lot of effort just to enable a computer to approach the object-recognition capabilities of a toddler.
Indeed, this is one of the paradoxes of computer vision. In some cases, vision systems can exceed the capabilities of humans (for example, detecting a person's heart rate by looking at their face). In many other cases, however, it is extraordinarily difficult to create algorithms that match human visual capabilities (such as distinguishing dogs from cats).
As I wrote in my previous column, improvements in processors and sensors are enabling system developers to incorporate sophisticated computer vision into a rapidly expanding range of products. But creating algorithms that robustly extract meaning from pixels remains a daunting challenge in many applications, in part due to the variety of different images that an object (like a cat) can present.
Classical object recognition approaches often attempt to identify objects by first detecting small features (like edges or corners), then assembling collections of these small features to identify larger features (such as an eye), and then reasoning about these larger features to deduce the identity of objects of interest (like a face). Such approaches can work very well when the objects of interest are uniform and the imaging conditions are favorable (for example, inspecting bottles on an assembly line to ensure the correct labels are properly affixed).
But these approaches often struggle when conditions are more challenging, such as when the objects of interest are deformable, when there can be significant variation in appearance from individual to individual, and when illumination is poor. With recent improvements in processors and sensors, a case can be made that good algorithms are now the bottleneck in creating effective "machines that see."
In light of this grand challenge in computer vision, I was very excited to hear the outstanding morning keynote presentation delivered by Yann LeCun of New York University (who is also now the Director of Artificial Intelligence at Facebook) at the recent Embedded Vision Summit. LeCun made a convincing case for a very different approach to object recognition. In essence, LeCun advocates that instead of "telling" our machines how to recognize objects ("first look for edges, then look for edges that might make circles, …"), we should "show" (or perhaps "train") them.
More specifically, LeCun described the use of convolutional neural networks – massively parallel algorithms made up of layers simple computation nodes (or "neurons"). Such networks do not execute programs. Instead, their behavior is governed by their structure (what is connected to what), the choice of simple computations that each node performs, and a training procedure.
So rather than trying to distinguish dogs from cats via a recipe-like series of steps, for example, a convolutional neural network would be taught how to categorize images through being shown a large set of example images. (Which, come to think of it, is pretty similar to how those remarkably capable human toddlers learn to make such distinctions.)
Will convolutional neural networks become the dominant approach to object recognition? It may be too early to say, but LeCun's keynote made a very convincing case. His talk, which was described by several attendees as "mind-expanding," certainly has changed my thinking. If you didn't get to see it live, you can now view Yann's presentation on-line. It is well worth your time.
Please visit BDTI's Computer Vision Design Services page to learn more about our convolutional neural network (CNN) expertise and capabilities.