Search This Blog

Tuesday, October 09, 2007

Cluster Computing and MapReduce

Lecture 1: Introducing Mapreduce and Cluster Computing
Distributed systems overview, review of synchronization and networking.



Lecture 2: The MapReduce programming model.
Overview of the MapReduce programming model.



Lecture 3: Distributed File Systems
Overview of distributed file systems with attention to the Google File System.




Lecture 4: Clustering Algorithms
Types of clustering algorithms, MapReduce implementations of K-Means and Canopy Clustering




Lecture 5: Graph Algorithms
Graph representations, distributed Pagerank, distributed Dijkstra.





Just as people are social animals, computers are social machines—the more, the merrier. Twenty or thirty years ago, large, centralized mainframes sat alone in sheltered bunkers in computer science departments and government offices alike, choking for hours on mere megabytes of data. Even with recent advances in server technology, large, centralized machines are still struggling to cope with today’s modern computational challenges, which now involve terabytes of data and processing requirements well beyond a single CPU (or two, or four, or eight). One computer just won’t hack it; these days, to support a new paradigm of massively parallel systems architecture, we need to break the machine out of its bunker and give it some friends.

In this age of “Internet-scale” computing, the new, evolving problems faced by computer science students and researchers require a new, evolving set of skills. It’s no longer enough to program one machine well; to tackle tomorrow’s challenges, students need to be able to program thousands of machines to manage massive amounts of data in the blink of an eye. This is how I, along with my good friend and mentor Ed Lazowska of the University of Washington’s CSE department, started to think about CS curricula and the obstacles to teaching a practical and authentic approach to massively parallel computing.

It's no easy feat. Teaching these methods effectively requires access to huge clusters and innovative new approaches to curricula. That's why we are pleased to announce the successful implementation of our Academic Cluster Computing Initiative pilot program at a handful of schools, including the University of Washington, Carnegie-Mellon University, Massachusetts Institute of Technology, Stanford University, the University of California at Berkeley and the University of Maryland. This pilot extends our expertise in large scale systems to strong undergraduate programs at the pilot schools, allowing individual students to take advantage of the hundreds of processors being made available. As the pilot progresses, we'll work with our technology partner IBM to shake the bugs out of the system so that we can expand the program to include more educators and academic researchers.

The future of computing is already taking shape on campuses today, and Google and IBM are thrilled to help inspire a new generation of computer scientists to think big. All of the course material developed by UW as well as other tools and resources to facilitate teaching this cutting- edge technology is available at http://code.google.com/edu. If you're a student wondering just what this sort of thing means for you, check out the five-part video lecture series (originally offered to Google Engineering interns) that introduces some of the fundamental concepts of large-scale cluster computing.

Earlier this year, the University of Washington partnered with Google to develop and implement a course to teach large-scale distributed computing based on MapReduce and the Google File System (GFS). The goal of developing the course was to expose students to the methods needed to address the problems associated with hundreds (or thousands) of computers processing huge datasets ranging into terabytes. I was excited to take the first version of the class, and stoked to serve as a TA in the second round.

But you can't program air, so Google provided a cluster computing environment to get us started. And since computers can't program themselves (yet?), UW provided the most essential component: students with sweet ideas for a huge cluster. After learning the ropes with these new tools, students finished the course by producing an impressive array of final projects, including an n-body simulator, a bot to perform Bayesian analysis on Wikipedia edits to search for spam, and an RSS aggregator that clustered news articles by geographic location and displayed them using the Google Maps API. Check out Geozette.

We are looking at ways to encourage other universities to get similar classes going, so we've also published the course material that was used at the University of Washington on Google Code for Educators. You're more than welcome to check out the Google Summer Intern video lectures on MapReduce, GFS, and parallelizing algorithms for large scale data processing. This summer I've been working on exposing these educational resources and other tools so that anyone can work on and think about cool distributed computing problems without the overhead of installing his or her own cluster. In that vein, we've released a virtual machine containing a pre-configured single node instance of Hadoop that has the same interface as a full cluster without any of the overhead. Feel free to give it a whirl.

We're happy to be able to expose students and researchers to the tools Googlers use everyday to tackle enormous computing challenges, and we hope that this work will encourage others to take advantage of the incredible potential of modern, highly parallel computing. Virtually all of this material is Creative Commons licensed, and we encourage educators to remix it, build upon it, and discuss it in the Google Code for Educators Forum.

Lastly, a quick shout out to the other interns who helped out on our team this summer: Aaron Kimball, Christophe Taton, Kuang Chen, and Kat Townsend. I'll miss you guys!

Resources:
http://google-code-updates.blogspot.com/2007/09/uw-and-google-teaching-in-parallel.html
http://code.google.com/edu/
http://code.google.com/edu/content/submissions/mapreduce-minilecture/listing.html
http://googleblog.blogspot.com/2007/10/let-thousand-servers-bloom.html

No comments: