Over the course of an hour, an unsolicited email skips your inbox and goes straight to spam, a car next to you auto-stops when a pedestrian runs in front of it, and an ad for the product you were thinking about yesterday pops up on your social media feed. What do these events all have in common? It’s artificial intelligence that has guided all these decisions. And the force behind them all is machine-learning algorithms that use data to predict outcomes.
Now, before we look at how machine learning aids data analysis, let’s explore the fundamentals of each.
What is Machine Learning?
Machine learning is the science of designing algorithms that learn on their own from data and adapt without human correction. As we feed data to these algorithms, they build their own logic and, as a result, create solutions relevant to aspects of our world as diverse as fraud detection, web searches, tumor classification, and price prediction.
In deep learning, a subset of machine learning, programs discover intricate concepts by building them out of simpler ones. These algorithms work by exposing multilayered (hence “deep”) neural networks to vast amounts of data. Applications for machine learning, such as natural language processing, dramatically improve performance through the use of deep learning.
What is Data Analysis?
Data analysis involves manipulating, transforming, and visualizing data in order to infer meaningful insights from the results. Individuals, businesses,and even governments often take direction based on these insights.
Data analysts might predict customer behavior, stock prices, or insurance claims by using basic linear regression. They might create homogeneous clusters using classification and regression trees (CART), or they might gain some impact insight by using graphs to visualize a financial technology company’s portfolio.
Until the final decades of the 20th century, human analysts were irreplaceable when it came to finding patterns in data. Today, they’re still essential when it comes to feeding the right kind of data to learning algorithms and inferring meaning from algorithmic output, but machines can and do perform much of the analytical work itself.
Why Machine Learning is Useful in Data Analysis
Machine learning constitutes model-building automation for data analysis. When we assign machines tasks like classification, clustering, and anomaly detection — tasks at the core of data analysis — we are employing machine learning.
We can design self-improving learning algorithms that take data as input and offer statistical inferences. Without relying on hard-coded programming, the algorithms make decisions whenever they detect a change in pattern.
Before we look at specific data analysis problems, let’s discuss some terminology used to categorize different types of machine-learning algorithms. First, we can think of most algorithms as either classification-based, where machines sort data into classes, or regression-based, where machines predict values.
Next, let’s distinguish between supervised and unsupervised algorithms. A supervised algorithm provides target values after sufficient training with data. In contrast, the information used to instruct an unsupervised machine-learning algorithm needs no output variable to guide the learning process.
For example, a supervised algorithm might estimate the value of a home after reviewing the price (the output variable) of similar homes, while an unsupervised algorithm might look for hidden patterns in on-the-market housing.
As popular as these machine-learning models are, we still need humans to derive the final implications of data analysis. Making sense of the results or deciding, say, how to clean the data remains up to us humans.
Machine-Learning Algorithms for Data Analysis
Now let’s look at six well-known machine-learning algorithms used in data analysis. In addition to reviewing their structure, we’ll go over some of their real-world applications.
At a local garage sale, you buy 70 monochromatic shirts, each of a different color. To avoid decision fatigue, you design an algorithm to help you color-code your closet. This algorithm uses photos of each shirt as input and, comparing the color of each shirt to the others, creates categories to account for every shirt. We call this clustering: an unsupervised learning algorithm that looks for patterns among input values and groups them accordingly. Here is a GeeksForGeeks article that provides visualizations of this machine-learning model.
You can think of a decision tree as an upside-down tree: you start at the “top” and move through a narrowing range of options. These learning algorithms take a single data set and progressively divide it into smaller groups by creating rules to differentiate the features it observes. Eventually, they create sets small enough to be described by a specific label. For example, they might take a general car data set (the root) and classify it down to a make and then to a model (the leaves).
As you might have gathered, decision trees are supervised learning algorithms ideal for resolving classification problems in data analysis, such as guessing a person’s blood type. Check out this in-depth Medium article that explains how decision trees work.
Imagine you’re en route to a camping trip with your buddies, but no one in the group remembered to check the weather. Noting that you always seem dressed appropriately for the weather, one of your buddies asks you to stand in as a meteorologist. Judging from the time of year and the current conditions, you guess that it’s going to be 72°F (22°C) tomorrow.
Now imagine that everyone in the group came with their own predictions for tomorrow’s weather: one person listened to the weatherman; another saw Doppler radar reports online; a third asked her parents; and you made your prediction based on current conditions.
Do you think you, the group’s appointed meteorologist, will have the most accurate prediction, or will the average of all four guesses be closer to the actual weather tomorrow? Ensemble learning dictates that, taken together, your predictions are likely to be distributed around the right answer. The average will likely be closer to the mark than your guess alone.
In technical terms, this machine-learning model frequently used in data analysis is known as the random forest approach: by training decision trees on random subsets of data points, and by adding some randomness into the training procedure itself, you build a forest of diverse trees that offer a more robust average than any individual tree. For a deeper dive, read this tutorial on implementing the random forest approach in Python.
Have you ever struggled to differentiate between two species — perhaps between alligators and crocodiles? After a long while, you manage to learn how: alligators have a U-shaped snout, while crocodiles’ mouths are slender and V-shaped; and crocodiles have a much toothier grin than alligators do. But on a trip to the Everglades, you come across a reptile that, perplexingly, has features of both — so how can you tell the difference? Support-vector machine (SVM) algorithms are here to help you out.
First, let’s draw a graph with one distinguishing feature (snout shape) as the x-axis and another (grin toothiness) as the y-axis. We’ll populate the graph with plenty of data points for both species, and then find possible planes (or, in this 2D case, lines) that separate the two classes.
Our objective is to find a single “hyperplane” that divides the data by maximizing the distance between the dividing plane and each class’s closest points — called support vectors. No more confusion between crocs and gators: once the SVM finds this hyperplane, you can easily classify the reptiles in your vacation photos by seeing which side each one lands on.
SVM algorithms can only be used on categorical data, but it’s not always possible to differentiate between classes with 2D graphs. To resolve this, you can use a kernel: an established pattern to map data to higher dimensions. By using a combination of kernels and tweaks to their parameters, you’ll be able to find a non-linear hyperplane and continue on your way distinguishing between reptiles. This YouTube video does a clear job of visualizing how kernels integrate with SVM.
If you’ve ever used a scatterplot to find a cause-and-effect relationship between two sets of data, then you’ve used linear regression. This is a modeling method ideal for forecasting and finding correlations between variables in data analysis.
For example, say you want to see if there’s a connection between fatigue and the number of hours someone works. You gather data from a set of people with a wide array of work schedules and plot your findings. Seeking a relationship between the independent variable (hours worked) and the dependent variable (fatigue), you notice that a straight line with a positive slope best models the correlation. You’ve just used linear regression! If you’re interested in a detailed understanding of linear regression for machine learning, check out this blog pos from Machine Learning Mastery.
While linear regression algorithms look for correlations between variables that are continuous by nature, logistic regression is ideal for classifying categorical data. Our alligator-versus-crocodile problem is, in fact, a logistic regression problem. Whereas the SVM model can work with non-linear kernels, logistic regression is limited to (and great for) linear classification. See this in-depth overview of logistic regression, especially good for lovers of calculus.
In this article, we looked at how machine learning can automate and scale data analysis. We summarized a few important machine-learning algorithms and saw their real-life applications.
While machine learning offers precision and scalability in data analysis, it’s important to remember that the real work of evaluating machine learning results still belongs to humans. If you think this could be a career path for you, check out Udacity’s Become a Machine Learning Enginee course.