Many of the modern computer vision applications rely on deep learning algorithms. Therefore, selecting and using datasets to train those deep learning algorithms is an essential skill for any computer vision engineer.

In this article, we’ll take a look at how computer vision datasets have refined the field of object identification and contributed to high-tech inventions like self-driving cars and the latest smartphones.

What is Computer Vision?

Computer vision is a type of artificial intelligence that aims to replicate the human vision system such that computers could process the content of objects and videos the same way that humans might.

How might you employ computer vision? Computer vision is what enables self-driving cars to stop for pedestrians and red lights, automates the detection of cancerous tumors in medical imaging, and allows you to unlock your iPhone by simply looking at your screen.

To better grasp computer vision’s potential, let’s back up and examine what that field was like when machine learning was in its infancy. Well before deep learning algorithms, developers would design small applications that could detect patterns from images. They would then call on statistical learning algorithms — algorithms based on linear regression, decision trees, and the like — to classify objects from those images.

Although these software applications would often outperform human experts, building them required many engineers and was time-intensive. Enter deep learning, a data-based model that offered an approach to machine learning that was unprecedented in every field. 

Indeed, the main drivers of computer vision’s recent growth are the field’s reliance on deep-learning algorithms and the increase in generated data on which these algorithms are based. Given the sheer amount of labeled data we have today, computer vision accuracy rates have skyrocketed. Let’s explore the importance of these datasets to computer vision and deep learning in general.

What is the Role of Datasets in Deep Learning?

At the heart of deep learning is the concept of neural networks: collections of connected nodes loosely modeled on the neurons of the human brain. Providing a neural network with examples of labeled data enables it to detect patterns, based on which it can then adjust the weights (also called parameters) by which unlabeled data is classified.

To make these mathematical equations (called algorithms) as precise as possible, engineers will ideally feed them thousands of pieces of labeled data. In other words, if we expect deep learning algorithms to accurately classify unlabeled data, we must feed them massive labeled datasets. Let’s look at a few ways such data is labeled, as well as how we might acquire these datasets for our own algorithms.

Hand-Labeling

Freelancers and data scientists often hand-label data, especially for hard-to-identify objects in images. For example, MIT and IBM research scientists hired freelancers from Amazon Mechanical Turk to create the images behind ObjectNet, a collection of images containing tipped-over objects or ones photographed from abnormal perspectives.

User-Generated Labels

If you’ve ever used Captcha to convince a website you’re not a robot, you’ve likely conducted data labeling. Those Google Recaptcha 3×3 grids with a prompt asking you to select all images with a stop sign help train Google’s artificial intelligence algorithms. By labeling data a certain way, not only are you helping classify objects that cannot be identified by traditional algorithms, but you might also be training algorithms that provide the intelligence behind technologies like self-driving cars. So when your autonomous vehicle stops at a stop sign, give yourself a pat on the back.

Already Labeled Data

Another way we can acquire vast amounts of data is to repurpose already available data. For example, Fashion-MNIST is a dataset of Zalando’s grayscale images of various fashion items. Comprising a training set of 60,000 examples and a test set of 10,000 examples, it is an alternative to the original MNIST dataset of handwritten digits. 

Now that we’ve gone over different ways to label and acquire data, let’s look at how computer vision algorithms intake this data and learn from it.

How Computer Vision Algorithms Learn From Labeled Data

Let’s say you want to create an algorithm based on a neural network to identify a particular dog breed. To train it, you would first train your neural network on thousands of images of dogs, and each image would be tagged with the pictured dog’s breed. By accounting for the tags, the neural network will become more proficient in distinguishing one canine breed from another. The more labeled dogs it sees, the more the algorithm’s “vision” gains in precision. 

Let’s now assume the algorithm already exists and you simply want to create an app that would allow each of your three dogs to unlock your new smartphone. To create a breed recognition feature, you would only need to select an algorithm that can determine different breeds and train it with the specific dogs it must detect — your lab, husky, and chihuahua. Upon ingesting your dogs’ faces from a variety of angles, the neural network will be able to detect their faces without further input or feature verification. 

Creating a powerful algorithm requires more than providing it with labeled data. Experts must also tune hyperparameters such as the number of layers of neural networks or the number of times the algorithm will work through an entire computer vision dataset.

Three Datasets for Computer Vision

We know the importance of computer vision datasets to creating strong deep-learning algorithms. However, data labeling is an expensive, time-consuming activity. Could you imagine how long it would take someone to label 10,000 images? Here are three present-day examples of datasets for computer vision.

MNIST and Fashion MNIST

Let’s take a look at the MNIST and Fashion MNIST databases. The original MNIST dataset, composed of handwritten digits, often serves as a benchmark for new algorithms developed by members of the AI/ML/Data Science communities. 

The creators of Fashion-MNIST sought to offer an alternative to MNIST, citing the latter’s inability to represent modern CV tasks, its overuse by the community, and its oversimplification (most pairs of digits could be distinguished by just one pixel).

Each example in Fashion-MNIST shares the same 28×28 image size that is standard for MNIST, as well as the structure of training and testing splits. Developers can therefore use the Fashion-MNIST dataset as a drop-in replacement in any machine-learning benchmark that uses the MNIST dataset.

ImageNet and ObjectNet

ImageNet, the crowdsourced photo database popular in the artificial intelligence community, contains 14 million images each attached to a labeled node that describes image contents. Each node has an average of 500 photos linked to it.

However, the relatively straightforward images in ImageNet can’t help enhance modern image-detection algorithms, which have already ingested everything ImageNet has to offer. When these algorithms started operating in the real world, their performance noticeably dropped compared to how well they did with the ImageNet dataset. The percentage of correctly identified objects dropped most significantly for images of objects taken at unusual angles. To close this performance gap, a team of MIT and IBM researchers set out to create a very different kind of object-recognition dataset. 

Called ObjectNet, this dataset contains about 50,000 images of tipped-over objects, as well as photos of “right-side-up” objects taken from unusual angles.

Labeled Faces in the Wild (LFW)

How have facial recognition technologies learned to accurately detect faces? Labeled Faces in the Wild, a database of face photographs, was designed to resolve unconstrained face recognition problems. This computer vision dataset has over 13,000 facial images, each labeled with the name of the person it pertains to. A fraction of those pictured (around 1,680) have more than one distinct photo in the dataset.

This dataset’s only major issue is that its images are based on only one recognition technology: the Viola-Jones face detector. Because this face detector served as a filter to create the database, few photos of side angles or views from above and below exist in the dataset.

Conclusion

In this article, we described computer vision and detailed the importance of datasets when working with computer vision. We also covered three labeled datasets available today. If this field of work interests you, consider enrolling in our Computer Vision Nanodegree program to become proficient in computer vision!

Start Learning