Even within the context of other hi-tech technologies, Hadoop went from obscurity to fame in a miraculously short about of time. It had to… the pressures driving the development of this technology were too great. If you are not familiar with Hadoop, let’s start by looking at the void it is trying to fill.
Companies, up until recently—say the last five to ten years or so—did not have the massive amounts of data to manage as they do today. Most companies only had to manage the data relating to running their business and managing their customers. Even those with millions of customers didn’t have trouble storing data using your everyday relational database like Microsoft SQL Server or Oracle.
But today, companies are realizing that with the growth of the Internet and with self-servicing (or SaaS) Web sites, there are now hundreds of millions of potential customers that are all voluntarily providing massive amounts of valuable business intelligence. Think of storing something as simple as a Web log that provides every click of every user on your site. How does a company store and manipulate this data when it is generating potentially trillions of rows of data every year?
Generally speaking, the essence of the problem Hadoop is attempting to solve is that data is coming in faster than hard drive capacities are growing. Today we have 4 TB drives available which can then be assembled on SAN or NAS devices to easily get 40 TB volumes or maybe even 400 TB volumes. But what if you needed a 4,000 TB or 4 Petabytes (PB) volume? The costs quickly get incredibly high for most companies to absorb…until now. Enter Hadoop.
One of the keys to Hadoop’s success is that it operates on everyday common hardware. A typical company has a backroom with hardware that has since past its prime. Using old and outdated computers, one can pack them full of relatively inexpensive hard drives (doesn’t need to be the same total capacity within each computer) and use them within a Hadoop cluster. Need to expand capacity? Add more computers or hard drives. Hadoop can leverage all the hard drives into one giant volume available for storing all types of data, from web logs to large video files. It is not uncommon for Hadoop to be used to store rows of data that are over 1GB per row!
The file system that Hadoop uses is called the Hadoop Distributed File System or HDFS. It is a highly fault tolerant file system that focuses on high availability and fast readabilities. It is best used for data that is written once and read often. It leverages all the hard drives in the systems when writing data because Hadoop knows that bottlenecks stem from writing and reading to a single hard drive. The more hard drives are used simultaneously during the writing and reading of data, the faster the system operates as a whole.
The HDFS file system operates in small file blocks which are spread across all hard drives available within a cluster. The block size is configurable and optimized to the data being stored. It also replicates the blocks over multiple drives across multiple computers and even across multiple network subnets. This allows for hard drives or computers to fail (and they will) and not disrupt the system. It also allows Hadoop to be strategic in which blocks it accesses during a read. Hadoop will choose to read certain replicated blocks when it feels it can retrieve the data faster using one computer over another. Hadoop analyses which computers and hard drives are currently being utilized, along with network bandwidth, to strategically pick the next hard drive to read a block. This produces a system that is very quick to respond to requests.
Despite the relatively odd name, MapReduce is the cornerstone of Hadoop’s data retrieval system. It is an abstracted programming layer on top of HDFS and is responsible for simplifying how data is read back to the user. It has a purpose similar to SQL in that it allows programmers to focus on building intelligent queries and not get involved in the underlying plumbing responsible for implementing or optimizing the queries. The “Map” part of the name refers to the task of building a map on the best way to sort and filter the information requested and then to return it as a pseudo result set. The “Reduce” task summarizes the data like the counting and summing of certain columns.
These two tasks are both analyzed by the Hadoop engine and then broken into many pieces or nodes (a divide and conquer model) which are all processed in parallel by individual workers. This result is the ability to process Petabytes of data in a matter of hours.
MapReduce is an open source project originally developed by Google and has been now ported over to many programming languages. You can find out more on MapReduce by visiting http://mapreduce.org.
In my next post, I’ll take a look at some of the other popular components around Hadoop, including advanced analytical tools like Hive and Pig. In the meantime, if you’d like to learn more about Hadoop, check out our new course.
Apache Hadoop, Hadoop, Apache, the Apache feather logo, and the Apache Hadoop project logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and other countries.
About the Author
Martin Schaeferle is the Vice President of Technology for LearnNowOnline. Martin joined the company in 1994 and started teaching IT professionals nationwide to develop applications using Visual Studio and Microsoft SQL Server. He has been a featured speaker at various conferences including Microsoft Tech-Ed, DevConnections and the Microsoft NCD Channel Summit. Today, he is responsible for all product and software development as well as managing the company’s IT infrastructure. Martin enjoys staying on the cutting edge of technology and guiding the company to produce the best learning content with the best user experience in the industry. In his spare time, Martin enjoys golf, fishing, and being with his wife and three teenage children.