By imbuing computers with the ability to see, we significantly change the way we live. Just take a look at the autonomous vehicle revolution, or the sophisticated satellite imagery we now use to track deforestation. All of this is thanks to machine learning and computer vision.
What follows is a primer on computer vision machine learning algorithms.
What Is Computer Vision?
Computer vision is a field of artificial intelligence that aims to emulate human-like vision in computers. Research on computer vision started back in the 1950s, so it’s quite a relic in computer science terms. However, it wasn’t until recent algorithmic and hardware advancements, as well as vast amounts of publicly accessible data, that the field really took off. Today, it’s one of the most exciting areas of research that’s constantly pushing the boundaries of possibility.
In the following sections, we’ll cover some of the most common applications of computer vision machine learning algorithms.
Face detection is perhaps the most ubiquitous computer vision machine learning application. Not too long ago, this technology seemed like something out of a sci-fi movie. Now, we regularly rely on face detection when using our smartphones, whether as a security feature or to help our cameras find the right focus when taking photos.
A face detection system needs to be robust because the images or video streams that it analyzes are often far from perfect. Images and streams often contain faces that are not front-facing or are obstructed by glasses, a scarf or some other item of clothing. Additionally, there’s always unavoidable image noise due to movement and constantly changing lighting conditions. In addition to overcoming these challenges, a face detection system needs to be quick enough to be used in real-time.
Below, we’ll go over Haar Cascades — the first face detection algorithm used in real-time applications.
Haar Cascades is a machine learning “classifier,” the series of rules that are used by a machine to classify data. The “cascade” in the name is due to the algorithm consisting of a cascade of classifiers applied one after another. Each classifier looks for a particular visual feature finding which serves as a precondition for applying the following classifier.
For example, the first classifier might look for an eye. If no eye anywhere on an image is found, the algorithm concludes that the image does not contain a face. There would be little point in executing subsequent classifiers that might look for cheeks or an eyebrow ridge. After all, classifiers are sequentially applied so that later-stage classifiers can detect increasingly complex features and provide greater confidence that a photo indeed contains a face.
While this description might be an oversimplification, you’re welcome to check out OpenCV’s implementation and a more detailed description of the algorithm. It’s worth gaining a deeper understanding into how this algorithm works because despite its predating deep learning, it has proven to be so powerful that it’s still used almost everywhere.
Have you ever seen your smartphone create a “memory” slideshow by grouping photos you’ve taken and assigning them labels like “Four-legged friends,” “Beach” or “Food”? What about social media’s uncanny ability to recognize and tag your friends in your photos? This is accomplished through computer vision machine learning algorithms for image classification.
Image classification is the task of assigning images to categories based on what they represent, such as “dog,” “cat” or “trumpet.” This has remained a complex task for some time because it’s difficult to map our instinctive knowledge of how something looks into rules, such as colors or shapes that can be applied to a group of pixels. Some of the earliest attempts at image classification did just that, but their solution proved to be too naive as the rules for classifying one particular object would often not transfer to another image of the same object if the photo had been taken from a slightly different angle.
Image classification in particular is an area where deep learning has shown spectacular results. Thanks to advancements in computer vision machine learning, today’s latest smartphone can easily classify objects and people regardless of the angle at which they’re photographed.
Convolutional Neural Networks
The success of deep learning in computer vision is thanks both to publicly accessible image datasets like ImageNet, as well as advances in artificial neural network architectures like convolutional neural networks (CNNs).
In contrast to the naive solution from earlier detected high-level features, convolutional neural networks analyze images from the bottom up. CNNs are composed of layers that first process an image by detecting its more rudimentary components like lines and curves, before slowly building up more complex representations. Each layer passes its findings on to the next, which learns something new on top of the representation received. Stacking multiple layers together results in the deeper layers learning increasingly complex features. For example, layers in the middle may learn to detect colors and abstract two-dimensional shapes, whereas layers towards the end of the network may look for a dog’s ears or a tail.
The best part about CNNs is that once a layer learns to recognize a dog’s ear, it’ll be able to recognize that dog regardless of whether its ear is at the top-left corner or in the middle of an image. This is thanks to the convolution operation, which is just a fancy way of saying that the folders are slid across the entire image from left to right and top to bottom.
Compared to the naive solution that would come up with rules defining how a dog might look, CNNs have a clear advantage: The “rules” learned by CNNs are less constrained and more generalizable, not to mention that CNNs learn which features are useful without needing direct human supervision.
Convolutional neural networks are the gold standard for image classification, and most computer vision tasks across different architectures utilize CNNs in different ways. As a computer vision practitioner, you’ll have to be comfortable with some of the most popular ones likes ResNet and AlexNet.
An especially exciting area of research is one that connects vision with other modalities. One example is neural image captioning systems that combine vision with language.
Image captioning is a demanding task because it goes beyond simply detecting and naming all objects in an image. Instead, a successful captioning system is able to sort through a variety of objects and determine which are the most salient and important enough to describe, and which ones can be ignored. Then, the system needs to identify the relationships among these objects before finally expressing its findings in natural language, all while respecting the applicable rules of syntax.
Let’s take a look at how this might be implemented.
How Does Image Captioning Work?
Image captioning systems use the so-called “encoder-decoder” architecture. The architecture consists of two parts:
- The encoder acts as a feature extractor. It takes an image in the form of raw pixels and outputs the features obtained by performing transformations on the input.
- The decoder provides us with a mapping between visual features and words.
The encoder part of the network is a CNN. A popular choice is the ResNet network mentioned above. The encoder’s job essentially is to extract visual features from an image. These features are not always easily understood, nor do they always make visual sense, but they often correspond to objects, body parts or fragments of architecture and nature.
By contrast, the decoder is a type of neural network pioneered in natural language processing called the Recurrent Neural Network (RNN). It operates on the encoder’s output and generates a description of the image that’s both adequate and grammatical.
Image captioning is closely related to other computer vision and language tasks such as scene understanding and visual question answering. This research is still in its infancy but has the potential to transform our lives, for example by helping the visually impaired better navigate their surroundings.
Want To Become a Computer Vision Expert?
In this article, we explained some of the most common computer vision machine learning applications and the algorithms that power them.
In our Computer Vision Nanodegree program, we cover the theoretical underpinnings of computer vision algorithms and take you through practical exercises to help you build a portfolio of computer vision projects.