Monday, January 10, 2011

Distributed Computing, Part 1


You've read or heard about them. Distributed computing frameworks or technologies like Hadoop or Google File System (GFS) and Google MapReduce. This series of posts aims to shed some light on when, why and how you would implement distributed computing techniques. If most of the systems you've encountered don't deal with storing and processing large amounts of data (terabytes and up) on a daily basis, or your company chooses to purchase 3rd party data storage and analysis services, then you may not have realized the critical advances that have been made this past decade in the distributed computing field. 

Since this is part 1 of the series, I'll touch on the three most important problems that distributed systems aim to solve. In subsequent posts I’ll dive deep into what sparked it all - Google File System and Google MapReduce. Then we’ll explore an open source distributed computing platform based off of GFS and MapReduce. This platform called Hadoop, was created by Doug Cutting and spearheaded by Yahoo!. Hadoop is used by many technology giants such as Facebook, Twitter, Amazon and IBM to name a few. We’ll eventually take Hadoop out of the lot for a test drive.

Let’s get our feet wet.

Problem 1 - High-end servers are expensive

No company likes shelling out thousands of dollars for high end servers. 10 years ago, some could say that servers like these were a necessary evil since data collection is the life blood of any organization and all data must be collected at all costs. Enterprises today continue to intake an enormous amount structured and unstructured data per day and it's growing exponentially.

How do you begin to store massive amounts of data? If we take a look back to the early 2000’s, companies were spending some serious coin to either build massive fault tolerant storage systems from scratch or enlisting companies like IBM, HP, SAS etc. to help manage and analyze their data. Either way, you needed high-end servers that come with all the bells and whistles. These servers come with a very high total cost of ownership. It costs a fortune to acquire, a fortune to maintain and like always hardware will fail often and need expensive replacement. 

On top of that, as data capacity limits red line more hardware needs to be acquired and added to the system. Sure, these systems are built for scalability but it's not easy explaining to your CIO that you need another rack full of expensive blades and switches. This type of horizontal scaling gets expensive very quick. Once you're invested in this type of technology stack you are pretty much locked in for the long haul and it's a vicious cycle. You can bank on your ROI margins dropping over time.

Problem 2 - It's tough to efficiently store massive amounts of data

I’m not talking about a few hundred thousand gigs. I'm talking about storing and processing terabytes, petabytes or greater of information on a daily basis. The amount of data that is passed over the internet alone is mind boggling. If you can imagine the birth of the internet as being the Big Bang then we haven’t even entered the cooling phase. That is how explosive and ever expanding the internet and modern systems continue to be. According to studies performed by to Cisco, as of March 2010 the amount of data that is exchanged across the internet per month is *drumroll*... 21 exabytes.

Wait, what? 21 exabytes per month?! If you were to store all of the words ever spoken by human beings since the dawn of time, it would take only 5 exabytes. Cisco supects that in 2013, that number of data exchanged per month on the internet will grow to 56 exabytes.

That’s a lot of data.

Problem 3 - It's tough to analyze and process massive amounts of data

Data comes in many formats such as text, audio, video and images that ranges from structured, semi-unstructured and unstructured. How would you begin perform analysis on this data to find patterns across different formats and types? For instance, how do you determine the patterns of a given user given his purchase history (structured data) and surfing behaviors (semi-structured log files)? It’s pretty tough... and costly... and computation could take a long time if resources aren't allocated in an efficient manner. Regardless, companies know that the answers to these questions provide insight that is paramount to their business. It can discover patterns that can be key to understanding markets, drive new strategies, improve business process, perform scientific calculations to enhance offerings etc.

That’s why companies that regularly deal with massive amounts of data, like Google, started looking for cheaper, more efficient solutions. Their systems needed to be scalable and elastic as ever.

Enter Google File System (GFS)

In 2003, Google published a paper detailing some of the science and design behind “The Google File System”. I’ll dive into the GFS in the next post but at a high level, this file system was developed to provide fault tolerant, reliable data storage using clusters of inexpensive commodity servers. Not the high-end servers that cost thousands of dollars. Just simple, affordable rack mountable Linux servers.

This design that Google developed directly solved problems #1 and #2.

Then Google MapReduce

The very next year after Google released it’s paper on GFS, they released another paper called “MapReduce: Simplified Data Processing on Large Clusters”. The purpose of their now patented MapReduce framework is to be able to process and generate large data sets across a cluster of commodity machines in a distrubuted fashion. In essence, it is always better to move computation to data rather than moving data to computation. This exploits the benefits of utilizing data locality in a clustered environment. It is modeled after the map/reduce strategies of functional programming wherein there are two phases, map then reduce.

During the map phase, computation is broken up into smaller chunks and sent  to all data nodes in the cluster. Since data is distributed across the cluster, computation will be executed in a parallel fashion since each data node has its own subset of data. When each node is done with its computation, the resulting data set then sent to the master node.

The master node then starts the reduce phase. All information sent back from the data nodes are combined to get the original “answer” or output. In typical cases, this data is saved for processing by another program.

Google MapReduce solved problem #3 by providing faster computation and analysis using distributed techniques in tandem with GFS.


- There are three main reasons why’d you want to look into a distributed computing framework.
  1. You need cheap, reliable, fault tolerant data storage that scales horizontally.
  2. The amount of data you collect is growing exponentially. (terabytes, petabytes, exabytes and beyond)
  3. You need to efficiently process and analyze high volumes of unstructured, semi-structured, and structured data.
- Although various distributed computing techniques have been around for decades they are often proprietary and expensive. 

- Google's white papers on GFS and MapReduce sparked an open source distributed computing revolution (like Hadoop).

In the next posts, we’ll explore GFS and MapReduce then introduce you to Hadoop.

I hope this helps shed a little light on distributed computing and hope that you'll to read the next articles in the series. Until then, happy coding!



jmurphy said...

Karmasphere Studio is a graphical environment to develop, debug, deploy and monitor MapReduce jobs. It cuts the time, cost and effort to get results from Hadoop. By making it easy to learn, implement and optimize MapReduce jobs, Karmasphere Studio increases productivity by shielding users from the intricacies of Hadoop.

Karmasphere Studio

Micky Dionisio said...

Thanks for the link jmurphy. I'm assuming the community edition is free while the premium edition is paid?

At first glance it looks like an entire IDE for MapReduce built into Eclipse?

I'll have a deeper look, thanks for this!