Over the past decade, big data has had a tremendous impact on business — yet we’re still just scratching the surface. To move forward, we need to have the right tools in place to manage big data’s potential. In this article, we’ll present Hadoop as one solution to the problem of storing and analyzing big data.
What Is Hadoop?
Hadoop is an open-source software used for storing, processing and accessing large datasets. By “large,” we mean datasets that cannot be conveniently stored on a single computer but rather need to be stored across multiple machines.
If you’ve written a SQL query before, you know that wrangling data is no easy task even when working on a single database. Working with multiple databases stored across many machines quickly becomes difficult, not to mention the amount of effort that’s required to build and maintain a distributed system infrastructure.
Hadoop provides a layer of abstraction that takes care of all these things, allowing programmers to focus on the data rather than worry about the low-level details of implementation. Today, Hadoop powers thousands of companies, including Twitter. A Twitter blog post reports that the social media giant was already using Hadoop in 2015 to store more than mind-boggling 300 petabytes of data across tens of thousands of servers!
But before we get to explaining the technology that powers Hadoop, we’ll briefly cover its history and conception within a pioneering internet search engine project, as well as explain why such a tool was needed in the first place.
A Short History of Hadoop
As the internet was starting to outgrow itself in the 1990s, there was an increased need to access it in an effective and structured way. In the pre-Web internet era, searching the internet was not a user-friendly experience. Various attempts were made to make searches simpler and more efficient, including “The Gopher Project,” which envisioned the internet as a centralized database that unified all content that existed on the internet that was otherwise inaccessible unless you knew the website’s exact location. This was in a way a brochure of the internet. Website owners would contact the brochure’s creators if they wanted their website to be added to the list.
Although this was a big leap towards making the internet more searchable, having a single system with human-curated resources had obvious limitations in terms of searchability. When the number of total websites online quickly grew to millions, it became obvious that this was no task for humans. There was a clear opportunity for automation. One of the first internet search projects, Apache Nutch, did just that.
The idea of Apache Nutch was to create a web search engine that consisted of two components. The first component was a crawler that combed through the internet looking for web pages and indexing them, while the second was a unit that processed the resulting data and returned the websites that were most relevant to the search query. In this way, the novel tool had increased the breadth of internet search beyond human-curated lists.
However, a problem soon emerged: Not only was a single computer unable to store and process such large amounts of data quickly, but it also became evident that the Apache Nutch system would not scale as the internet continued expanding. To circumvent this problem, the team came up with a way of distributing computation across multiple computers, resulting in a significantly faster engine.
The Nutch project was later split into two projects. The first was the web crawler, which kept the name “Nutch.” The second portion related to distributed computing became what is now Hadoop, named after the stuffed elephant of the son of co-creator Doug Cutting.
The Need For Distributed Computing
We’ve covered why distributing computation across multiple computers is very efficient for shortening computation times, but it can also improve storage access times.
Storage devices have seen quite the evolution, from the floppy disks of the 1990s with an average capacity of 1.4 MB, to similarly-sized modern hard drives capable of storing terabytes. However, although storage capacity has greatly increased, it has not been matched by faster read and write speeds — these have seen only moderate increases. So, while we now have the benefits of disks that can store huge amounts of data, access speed is a bottleneck in how useful this data may be.
Distributed computing is a very efficient solution to this problem. Say we wanted to read a 1 GB file from a single disk. If that disk were to have a read speed of 100 MB/s, it would take us 10 seconds to read it. By contrast, if in a distributed scenario this file were saved across 10 smaller disks, it would take only one second to read the 10 chunks — and then we’d only need a little extra time to combine them.
How Does Hadoop Work?
Computer scientists call problems such as the one above “embarrassingly parallel.” These are the problems that can be structured into multiple independent subtasks that are essentially identical, such as the one above where all subtasks involve reading data from disk. The subtasks are handled in parallel by multiple computers rather than a single computer, resulting in much faster computation.
This, in a nutshell, is Hadoop’s core approach: It splits work onto multiple CPU nodes and then coordinates the gathering and aggregation of results. This is achieved through Hadoop’s architecture, which consists of three main layers:
- A storage layer that uses the Hadoop Distributed File System (HDFS) to store and retrieve data;
- A processing layer that uses the MapReduce model for parallel computation; and,
- A resource management layer called Yet Another Resource Negotiator (YARN) that schedules tasks and manages clusters.
In the following section, we’ll cover MapReduce, the powering engine behind Hadoop.
What Is Hadoop’s MapReduce?
MapReduce is a programming model for parallel data processing. Hadoop is one of the most popular implementations of MapReduce, but there are many different implementations across various languages. MapReduce works by separating computation into two steps: the map step and the reduce step. The map step breaks down (or maps) problems into subproblems, while the reduce step combines (or reduces) the solutions to subproblems to form a solution to the main problem.
MapReduce is generally great for tasks that require huge amounts of data but that are not too difficult, or in other words, tasks that are parallelizable. One such task is web search, which is the original task for which Google created MapReduce. A web search engine takes in a user’s search query and matches it against many documents in parallel before returning a ranking of the best-matching documents to the user.
Those familiar with functional programming will immediately notice the similarities between functional programming’s map and reduce functions. However, MapReduce’s key contribution is not in combining these two functions but rather in orchestrating them in a way that’s scalable and parallelizable.
Should You Use Hadoop?
In the following sections, we’ll present arguments for and against using Hadoop.
Big or Small Data?
Hadoop is used for processing big data — this means at least terabytes of data in diverse formats, something that a single computer cannot process. The effort of maintaining a Hadoop cluster for small datasets is not worth it. Not only is it unnecessarily expensive and complicated, but there are many faster and more efficient (and free!) methods for analyzing small data. A general unwritten rule is that if your data fits into memory, you should analyze it locally.
Hadoop owes its quick processing speed to its distributed computing nature — data is stored, accessed and processed by leveraging multiple CPUs that work in parallel. Should you need more compute power, you may add or remove nodes at any point. Hadoop is scalable both horizontally (by adding more nodes) and vertically (by making existing nodes more powerful).
All this is to say that Hadoop is designed for the heavy parallelized processing of big data. If your task does not require you to scale your cluster beyond one node, you’re not realizing Hadoop’s benefits and the tradeoffs like administrative overhead might not be worth it. In such cases, you might be better off replacing Hadoop with your laptop — you might be surprised how far Excel, Python and command-line interface can get you.
Real-Time or Batch Processing?
Hadoop is a viable solution for many big data tasks, but it’s not a panacea. Although Hadoop is great when you need to quickly process large amounts of data, it’s not fast enough for those who need real-time results. Hadoop processes data in batches rather than in streams, so if you need real-time data analysis, tools like Kinesis or Spark might be a better choice.
Being a batch processing tool, this also means that Hadoop is not great if you frequently need to perform ad hoc queries. In such cases, a traditional database or a warehouse might be a better solution. On the other hand, Hadoop can be used for updating databases and warehouses.
Is Hadoop Appropriate for Your Task?
Hadoop’s main purpose is to perform extract, transform and load (ETL) jobs. You may use Hadoop for all kinds of processing tasks such as cleaning, preprocessing and transforming your raw data into a useful format, but Hadoop does not provide analyses — this is still up to the analysts.
Another important consideration is determining whether MapReduce might be the optimal solution to your problem. For example, MapReduce would work well for a task like counting word frequencies in a huge corpus. To perform this task, MapReduce would split the corpus into multiple segments, count the word occurrence for each segment and combine the solutions by summing word counts. However, if you wanted to run an algorithm that iterates over your corpus multiple times (think clustering), MapReduce would not be the most efficient solution, because such a task is better performed iteratively and its success depends on having access to the whole dataset.
Set Up Your Own Hadoop Cluster
In this article, we’ve covered what Hadoop is, its use cases and the general technology behind it. Enroll in our Introduction to Programming Nanodegree to acquire the skills required to manage your own Hadoop cluster.