Hey my fellow tech geeks. Just like you, I love tech. I live and breathe it. It consumes my 9-5 and 5-9. But if your life and job is as crazy as mine, you know that it's tough finding time to catch up on all your tech feeds. If we fall behind for even a couple of days you start to feel guilty and "disconnected". So I sought out a way to try and change that.
My commute to the Yahoo! campus in Burbank, CA takes about 45 minutes (that's on a good day!). I often load up on audiobooks but sometimes I have A.D.D. and can't focus on listening to one topic at length. Music and radio, get old pretty quick too. While I do appreciate the way that Ryan Seacrest can talk about nothing and make it sound interesting I can't handle it for more than a 15 minutes.
One day as I was driving home from work it hit me that my time on the road could be better spent catching up my tech feeds. So in classic programmer spirit, I set out to create a mobile app that would read my favorite tech blogs to me while I'm on the road.
So came the birth of Talk Techy to Me. This version is free of charge while I work on the paid version that is fully customizable. At the time of this posting it currently supports six popular tech feeds:
1) TechCrunch
2) Official Google Blog
3) High Scalability (for my programming peeps!)
4) Mashable
5) ReadWriteWeb
6) VentureBeat.
Give it a whirl, spread the word, and send me some feedback. Remember it's in beta so when you find bugs please contact me and I'll get it sorted out ASAP. And hey, if you like the app and find it useful, a good review wouldn't hurt ;)
I guess this constitutes my foray into mobile development...
UPDATE - By popular demand, Engadget has been added to the list of free feeds!
-Mick
Talk Techy to Me
Talk Techy to Me in the Android market
... thoughts on software design, development and everything in between.
Wednesday, March 16, 2011
Monday, January 10, 2011
Distributed Computing, Part 1
Abstract
You've read or heard about them. Distributed computing frameworks or technologies like Hadoop or Google File System (GFS) and Google MapReduce. This series of posts aims to shed some light on when, why and how you would implement distributed computing techniques. If most of the systems you've encountered don't deal with storing and processing large amounts of data (terabytes and up) on a daily basis, or your company chooses to purchase 3rd party data storage and analysis services, then you may not have realized the critical advances that have been made this past decade in the distributed computing field.
Since this is part 1 of the series, I'll touch on the three most important problems that distributed systems aim to solve. In subsequent posts I’ll dive deep into what sparked it all - Google File System and Google MapReduce. Then we’ll explore an open source distributed computing platform based off of GFS and MapReduce. This platform called Hadoop, was created by Doug Cutting and spearheaded by Yahoo!. Hadoop is used by many technology giants such as Facebook, Twitter, Amazon and IBM to name a few. We’ll eventually take Hadoop out of the lot for a test drive.
Let’s get our feet wet.
Problem 1 - High-end servers are expensive
No company likes shelling out thousands of dollars for high end servers. 10 years ago, some could say that servers like these were a necessary evil since data collection is the life blood of any organization and all data must be collected at all costs. Enterprises today continue to intake an enormous amount structured and unstructured data per day and it's growing exponentially.
How do you begin to store massive amounts of data? If we take a look back to the early 2000’s, companies were spending some serious coin to either build massive fault tolerant storage systems from scratch or enlisting companies like IBM, HP, SAS etc. to help manage and analyze their data. Either way, you needed high-end servers that come with all the bells and whistles. These servers come with a very high total cost of ownership. It costs a fortune to acquire, a fortune to maintain and like always hardware will fail often and need expensive replacement.
On top of that, as data capacity limits red line more hardware needs to be acquired and added to the system. Sure, these systems are built for scalability but it's not easy explaining to your CIO that you need another rack full of expensive blades and switches. This type of horizontal scaling gets expensive very quick. Once you're invested in this type of technology stack you are pretty much locked in for the long haul and it's a vicious cycle. You can bank on your ROI margins dropping over time.
Problem 2 - It's tough to efficiently store massive amounts of data
I’m not talking about a few hundred thousand gigs. I'm talking about storing and processing terabytes, petabytes or greater of information on a daily basis. The amount of data that is passed over the internet alone is mind boggling. If you can imagine the birth of the internet as being the Big Bang then we haven’t even entered the cooling phase. That is how explosive and ever expanding the internet and modern systems continue to be. According to studies performed by to Cisco, as of March 2010 the amount of data that is exchanged across the internet per month is *drumroll*... 21 exabytes.
Wait, what? 21 exabytes per month?! If you were to store all of the words ever spoken by human beings since the dawn of time, it would take only 5 exabytes. Cisco supects that in 2013, that number of data exchanged per month on the internet will grow to 56 exabytes.
That’s a lot of data.
Problem 3 - It's tough to analyze and process massive amounts of data
Data comes in many formats such as text, audio, video and images that ranges from structured, semi-unstructured and unstructured. How would you begin perform analysis on this data to find patterns across different formats and types? For instance, how do you determine the patterns of a given user given his purchase history (structured data) and surfing behaviors (semi-structured log files)? It’s pretty tough... and costly... and computation could take a long time if resources aren't allocated in an efficient manner. Regardless, companies know that the answers to these questions provide insight that is paramount to their business. It can discover patterns that can be key to understanding markets, drive new strategies, improve business process, perform scientific calculations to enhance offerings etc.
That’s why companies that regularly deal with massive amounts of data, like Google, started looking for cheaper, more efficient solutions. Their systems needed to be scalable and elastic as ever.
Enter Google File System (GFS)
In 2003, Google published a paper detailing some of the science and design behind “The Google File System”. I’ll dive into the GFS in the next post but at a high level, this file system was developed to provide fault tolerant, reliable data storage using clusters of inexpensive commodity servers. Not the high-end servers that cost thousands of dollars. Just simple, affordable rack mountable Linux servers.
This design that Google developed directly solved problems #1 and #2.
Then Google MapReduce
The very next year after Google released it’s paper on GFS, they released another paper called “MapReduce: Simplified Data Processing on Large Clusters”. The purpose of their now patented MapReduce framework is to be able to process and generate large data sets across a cluster of commodity machines in a distrubuted fashion. In essence, it is always better to move computation to data rather than moving data to computation. This exploits the benefits of utilizing data locality in a clustered environment. It is modeled after the map/reduce strategies of functional programming wherein there are two phases, map then reduce.
During the map phase, computation is broken up into smaller chunks and sent to all data nodes in the cluster. Since data is distributed across the cluster, computation will be executed in a parallel fashion since each data node has its own subset of data. When each node is done with its computation, the resulting data set then sent to the master node.
The master node then starts the reduce phase. All information sent back from the data nodes are combined to get the original “answer” or output. In typical cases, this data is saved for processing by another program.
Google MapReduce solved problem #3 by providing faster computation and analysis using distributed techniques in tandem with GFS.
Conclusion
- There are three main reasons why’d you want to look into a distributed computing framework.
- Google's white papers on GFS and MapReduce sparked an open source distributed computing revolution (like Hadoop).
In the next posts, we’ll explore GFS and MapReduce then introduce you to Hadoop.
I hope this helps shed a little light on distributed computing and hope that you'll to read the next articles in the series. Until then, happy coding!
Micky
You've read or heard about them. Distributed computing frameworks or technologies like Hadoop or Google File System (GFS) and Google MapReduce. This series of posts aims to shed some light on when, why and how you would implement distributed computing techniques. If most of the systems you've encountered don't deal with storing and processing large amounts of data (terabytes and up) on a daily basis, or your company chooses to purchase 3rd party data storage and analysis services, then you may not have realized the critical advances that have been made this past decade in the distributed computing field.
Since this is part 1 of the series, I'll touch on the three most important problems that distributed systems aim to solve. In subsequent posts I’ll dive deep into what sparked it all - Google File System and Google MapReduce. Then we’ll explore an open source distributed computing platform based off of GFS and MapReduce. This platform called Hadoop, was created by Doug Cutting and spearheaded by Yahoo!. Hadoop is used by many technology giants such as Facebook, Twitter, Amazon and IBM to name a few. We’ll eventually take Hadoop out of the lot for a test drive.
Let’s get our feet wet.
Problem 1 - High-end servers are expensive
No company likes shelling out thousands of dollars for high end servers. 10 years ago, some could say that servers like these were a necessary evil since data collection is the life blood of any organization and all data must be collected at all costs. Enterprises today continue to intake an enormous amount structured and unstructured data per day and it's growing exponentially.
How do you begin to store massive amounts of data? If we take a look back to the early 2000’s, companies were spending some serious coin to either build massive fault tolerant storage systems from scratch or enlisting companies like IBM, HP, SAS etc. to help manage and analyze their data. Either way, you needed high-end servers that come with all the bells and whistles. These servers come with a very high total cost of ownership. It costs a fortune to acquire, a fortune to maintain and like always hardware will fail often and need expensive replacement.
On top of that, as data capacity limits red line more hardware needs to be acquired and added to the system. Sure, these systems are built for scalability but it's not easy explaining to your CIO that you need another rack full of expensive blades and switches. This type of horizontal scaling gets expensive very quick. Once you're invested in this type of technology stack you are pretty much locked in for the long haul and it's a vicious cycle. You can bank on your ROI margins dropping over time.
Problem 2 - It's tough to efficiently store massive amounts of data
I’m not talking about a few hundred thousand gigs. I'm talking about storing and processing terabytes, petabytes or greater of information on a daily basis. The amount of data that is passed over the internet alone is mind boggling. If you can imagine the birth of the internet as being the Big Bang then we haven’t even entered the cooling phase. That is how explosive and ever expanding the internet and modern systems continue to be. According to studies performed by to Cisco, as of March 2010 the amount of data that is exchanged across the internet per month is *drumroll*... 21 exabytes.
Wait, what? 21 exabytes per month?! If you were to store all of the words ever spoken by human beings since the dawn of time, it would take only 5 exabytes. Cisco supects that in 2013, that number of data exchanged per month on the internet will grow to 56 exabytes.
That’s a lot of data.
Problem 3 - It's tough to analyze and process massive amounts of data
Data comes in many formats such as text, audio, video and images that ranges from structured, semi-unstructured and unstructured. How would you begin perform analysis on this data to find patterns across different formats and types? For instance, how do you determine the patterns of a given user given his purchase history (structured data) and surfing behaviors (semi-structured log files)? It’s pretty tough... and costly... and computation could take a long time if resources aren't allocated in an efficient manner. Regardless, companies know that the answers to these questions provide insight that is paramount to their business. It can discover patterns that can be key to understanding markets, drive new strategies, improve business process, perform scientific calculations to enhance offerings etc.
That’s why companies that regularly deal with massive amounts of data, like Google, started looking for cheaper, more efficient solutions. Their systems needed to be scalable and elastic as ever.
Enter Google File System (GFS)
In 2003, Google published a paper detailing some of the science and design behind “The Google File System”. I’ll dive into the GFS in the next post but at a high level, this file system was developed to provide fault tolerant, reliable data storage using clusters of inexpensive commodity servers. Not the high-end servers that cost thousands of dollars. Just simple, affordable rack mountable Linux servers.
This design that Google developed directly solved problems #1 and #2.
Then Google MapReduce
The very next year after Google released it’s paper on GFS, they released another paper called “MapReduce: Simplified Data Processing on Large Clusters”. The purpose of their now patented MapReduce framework is to be able to process and generate large data sets across a cluster of commodity machines in a distrubuted fashion. In essence, it is always better to move computation to data rather than moving data to computation. This exploits the benefits of utilizing data locality in a clustered environment. It is modeled after the map/reduce strategies of functional programming wherein there are two phases, map then reduce.
During the map phase, computation is broken up into smaller chunks and sent to all data nodes in the cluster. Since data is distributed across the cluster, computation will be executed in a parallel fashion since each data node has its own subset of data. When each node is done with its computation, the resulting data set then sent to the master node.
The master node then starts the reduce phase. All information sent back from the data nodes are combined to get the original “answer” or output. In typical cases, this data is saved for processing by another program.
Google MapReduce solved problem #3 by providing faster computation and analysis using distributed techniques in tandem with GFS.
Conclusion
- There are three main reasons why’d you want to look into a distributed computing framework.
- You need cheap, reliable, fault tolerant data storage that scales horizontally.
- The amount of data you collect is growing exponentially. (terabytes, petabytes, exabytes and beyond)
- You need to efficiently process and analyze high volumes of unstructured, semi-structured, and structured data.
- Google's white papers on GFS and MapReduce sparked an open source distributed computing revolution (like Hadoop).
In the next posts, we’ll explore GFS and MapReduce then introduce you to Hadoop.
I hope this helps shed a little light on distributed computing and hope that you'll to read the next articles in the series. Until then, happy coding!
Micky
Subscribe to:
Posts (Atom)