Tag Archives: Hadoop

Hadoop…Pigs, Hives, and Zookeepers, Oh My!

zookeeper

If there is one aspect of Hadoop that I find particularly entertaining, it is the naming of the various tools that surround Hadoop. In my 7/3 post, I introduced Hadoop, the reasons for its growing popularity, and the core framework features. In this post, I will introduce you to the many different tools, and their clever names, that augment Hadoop and make it more powerful. And yes, the names in the title of this blog are actual tools.

Pig
The power behind Pig is that it provides developers with a simple scripting language that performs rather complex MapReduce queries. Originally developed by a team at Yahoo and named for its ability to devour any amount and any kind of data. The scripting language (yes, you guessed it, called Pig Latin) provides the developer with a set of high level commands to do all kinds of data manipulation like joins, filters, and sorts.

Unlike the SQL language, Pig is a more procedure or script-oriented query language. SQL, by design, is more declarative. The benefit of a procedural design is that you have more control over the processing of your data. For example, you can inject user code at any point within the process to control the flow.

Hive
To complement Pig, Hive provides developers a declarative query language similar to SQL. For many developers who are familiar with building SQL statements for relational databases like SQL Server and Oracle, Hive will be significantly easier to master. Originally developed by a team at Facebook, it has quickly become one of the most popular methods of retrieving data from Hadoop.

Hive uses a SQL-like implementation called HiveQL or HQL. Although it doesn’t strictly conform to the SQL ’92 standard, it does provide many of the same commands. The key language limitation relative to the standard is that there is no transactional support. HQL supports both ODBC and JDBC so developers can leverage many different programming languages like Java, C#, PHP, Python, and Ruby.

Oozie
To tie these query languages together for complex tasks requires an advanced workflow engine. Enter Oozie—a workflow scheduler for Hadoop that allows multiple queries from multiple query languages to be assembled into a convenient automated step-by-step process. With Oozie, you have total control over the flow to perform branching, decision-making, joining, and more. It can be configured to run at specific times or intervals and reports back logging and status information to the system. Oozie workflows can also accept user input parameters to add additional control. This allows developers to tweak the flow based on changing states or conditions of the system.

Sqoop
When deploying a Hadoop solution, one of the first steps is populating the system with data. Although data can come from many different sources, the most likely would be a relational database like Oracle, MySQL or SQL Server. For moving data to and from relational databases, Apache’s Sqoop is great tool to use. The name is derived from combining “SQL” and “Hadoop”; signifying the connection between SQL and Hadoop data.

Part of Sqoop’s power comes from the intelligence built-in to optimize the transfer of data both on the SQL side and the Hadoop side. It can query the SQL table’s schema to determine the structure of the incoming data, translate it into a set of intelligent data classes, and configure MapReduce to import the data efficiently into a Hadoop data store like HBase. Sqoop also provides the developer more granular control over the transfer by allowing them to import subsets of the data; for example, Sqoop can be told to only import specific columns within the table instead of the whole table.

Sqoop was even chosen by Microsoft as their preferred tool for moving SQL Server data into Hadoop.

Flume
Another popular data source for Hadoop outside of relational data is log or streaming data. Web sites, in particular, have a propensity to generate massive amounts of log data and more and more companies are finding out how valuable this data is to better understand their audience and their buying habits. So another challenge for the Hadoop community to solve was how to move log-based data into Hadoop. Apache tackled that challenge and released Flume (yes, think of a log flume).

The flume metaphor symbolizes the fact that this tool is dealing with streaming data like water down a rushing river. Unlike Sqoop which is typically moving static data, Flume needs to manage constant changes in data flow and be able to adjust to handle very busy times. For example, Web data may be coming in at an extreme high rate during a promotion. Flume is designed to scale itself to handle these changes in rates. Flume can also receive data from multiple streaming sources, even beyond Web logs, and does so with guaranteed delivery.

Zookeeper
There are many more tools I could cover but I’m going to wrap it up with one of my favorite tool names— Zookeeper. This tool comes into play when dealing with very large Hadoop installations. At some point in the growth of the system, as more and more computers are added to the cluster, there will be an increasing need to be able to manage and optimize the various nodes involved.

Zookeeper collects information about all nodes and organizes them in a hierarchy similar to how your operating system will create a hierarchy of all the files on your hard drive to make them easier to manage. The Zookeeper service is an in-memory service making it extremely fast, although it is limited by available RAM which may affect its scalability. It replicates itself across many of the nodes in the Hadoop system so that it maintains high availability and does not create a weak-link situation.

Zookeeper becomes the main hub that client machines connect to in order to obtain health information about the system as a whole. It is constantly monitoring all the nodes and logging events as they happen. With Zookeeper’s organized map of the system, it makes what could be a cumbersome task of checking on and maintaining each of the nodes individually a more enjoyable and manageable experience.

Summary
I hope this give you a taste of the many support tools that are available to Hadoop as well as illustrates the community’s commitment to this project. As technology goes, Hadoop is in the very early stages of its lifespan and components and tools are constantly changing. For more information about these and other tools, be sure to check out our new Hadoop course.

About the Author

martysMartin Schaeferle is the Vice President of Technology for LearnNowOnline. Martin joined the company in 1994 and started teaching IT professionals nationwide to develop applications using Visual Studio and Microsoft SQL Server. He has been a featured speaker at various conferences including Microsoft Tech-Ed, DevConnections and the Microsoft NCD Channel Summit. Today, he is responsible for all product and software development as well as managing the company’s IT infrastructure. Martin enjoys staying on the cutting edge of technology and guiding the company to produce the best learning content with the best user experience in the industry. In his spare time, Martin enjoys golf, fishing, and being with his wife and three teenage children.

The Power of Hadoop

hadoop.jpb

Even within the context of other hi-tech technologies, Hadoop went from obscurity to fame in a miraculously short about of time. It had to… the pressures driving the development of this technology were too great. If you are not familiar with Hadoop, let’s start by looking at the void it is trying to fill.

Companies, up until recently—say the last five to ten years or so—did not have the massive amounts of data to manage as they do today. Most companies only had to manage the data relating to running their business and managing their customers. Even those with millions of customers didn’t have trouble storing data using your everyday relational database like Microsoft SQL Server or Oracle.

But today, companies are realizing that with the growth of the Internet and with self-servicing (or SaaS) Web sites, there are now hundreds of millions of potential customers that are all voluntarily providing massive amounts of valuable business intelligence. Think of storing something as simple as a Web log that provides every click of every user on your site. How does a company store and manipulate this data when it is generating potentially trillions of rows of data every year?

Generally speaking, the essence of the problem Hadoop is attempting to solve is that data is coming in faster than hard drive capacities are growing. Today we have 4 TB drives available which can then be assembled on SAN or NAS devices to easily get 40 TB volumes or maybe even 400 TB volumes. But what if you needed a 4,000 TB or 4 Petabytes (PB) volume? The costs quickly get incredibly high for most companies to absorb…until now. Enter Hadoop.

Hadoop Architecture
One of the keys to Hadoop’s success is that it operates on everyday common hardware. A typical company has a backroom with hardware that has since past its prime. Using old and outdated computers, one can pack them full of relatively inexpensive hard drives (doesn’t need to be the same total capacity within each computer) and use them within a Hadoop cluster. Need to expand capacity? Add more computers or hard drives. Hadoop can leverage all the hard drives into one giant volume available for storing all types of data, from web logs to large video files. It is not uncommon for Hadoop to be used to store rows of data that are over 1GB per row!

The file system that Hadoop uses is called the Hadoop Distributed File System or HDFS. It is a highly fault tolerant file system that focuses on high availability and fast readabilities. It is best used for data that is written once and read often. It leverages all the hard drives in the systems when writing data because Hadoop knows that bottlenecks stem from writing and reading to a single hard drive. The more hard drives are used simultaneously during the writing and reading of data, the faster the system operates as a whole.

The HDFS file system operates in small file blocks which are spread across all hard drives available within a cluster. The block size is configurable and optimized to the data being stored. It also replicates the blocks over multiple drives across multiple computers and even across multiple network subnets. This allows for hard drives or computers to fail (and they will) and not disrupt the system. It also allows Hadoop to be strategic in which blocks it accesses during a read. Hadoop will choose to read certain replicated blocks when it feels it can retrieve the data faster using one computer over another. Hadoop analyses which computers and hard drives are currently being utilized, along with network bandwidth, to strategically pick the next hard drive to read a block. This produces a system that is very quick to respond to requests.

MapReduce
Despite the relatively odd name, MapReduce is the cornerstone of Hadoop’s data retrieval system. It is an abstracted programming layer on top of HDFS and is responsible for simplifying how data is read back to the user. It has a purpose similar to SQL in that it allows programmers to focus on building intelligent queries and not get involved in the underlying plumbing responsible for implementing or optimizing the queries. The “Map” part of the name refers to the task of building a map on the best way to sort and filter the information requested and then to return it as a pseudo result set. The “Reduce” task summarizes the data like the counting and summing of certain columns.

These two tasks are both analyzed by the Hadoop engine and then broken into many pieces or nodes (a divide and conquer model) which are all processed in parallel by individual workers. This result is the ability to process Petabytes of data in a matter of hours.

MapReduce is an open source project originally developed by Google and has been now ported over to many programming languages. You can find out more on MapReduce by visiting http://mapreduce.org.

In my next post, I’ll take a look at some of the other popular components around Hadoop, including advanced analytical tools like Hive and Pig. In the meantime, if you’d like to learn more about Hadoop, check out our new course.

Apache Hadoop, Hadoop, Apache, the Apache feather logo, and the Apache Hadoop project logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and other countries.

About the Author

martysMartin Schaeferle is the Vice President of Technology for LearnNowOnline. Martin joined the company in 1994 and started teaching IT professionals nationwide to develop applications using Visual Studio and Microsoft SQL Server. He has been a featured speaker at various conferences including Microsoft Tech-Ed, DevConnections and the Microsoft NCD Channel Summit. Today, he is responsible for all product and software development as well as managing the company’s IT infrastructure. Martin enjoys staying on the cutting edge of technology and guiding the company to produce the best learning content with the best user experience in the industry. In his spare time, Martin enjoys golf, fishing, and being with his wife and three teenage children.

Hadoop and Power Pivot courses in the works

Doug Ortiz, PowerPivot instructor

Doug Ortiz, Power Pivot instructor

We here in the LearnNowOnline production department have been busy with back-to-back shoots for two popular technologies that help you manage and analyze your data.

Barry Solomon was here covering Hadoop. Hadoop is a syntax based OS and application used to manage large data files. By large I mean data files that gets into terabytes of size. With many new ways of covering data these days it gets more and more important that companies have a way to manage this large amount of data. Hadoop works from a VM and then is used to connect to the database files. The course we’re working on will cover the “why” and “how” for Hadoop.

Last week we began production on our Power Pivot fundamentals courses with Doug Ortiz. Once you have all that data mentioned above, how do you use it effectively? Power Pivot helps a company to analyze that data. It is an add-on for Excel and can be run in a SharePoint environment. Power Pivot gives users the ability to import millions of lines of data from several database sources. Once the data is in Excel, a company then can use the Excel’s tools to help sort the data and use the analytical capabilities, such as Data Analysis Expressions (DAX).

Our new courses for Hadoop and Power Pivot are scheduled to be released in July. Watch our web site for more information.

About the Author

BrianBlogpicBrian Ewoldt is the Project Manager for LearnNowOnline. Brian joined the team in 2008 after 13 years of working for the computer gaming industry as a producer/project manager. Brian is responsible for all production of courses published by LearnNowOnline. In his spare time, Brian enjoys being with his family, watching many forms of racing, racing online, and racing Go-Karts.