HDFS: Report from my visit to the Berkeley RAD Labs

I went to attend the UC Berkeley RAD Lab Spring Retreat held at Santa Cruz. This Lab has about 30 Phd students and the quality of their work really impressed me a lot. Most of their work is based on research on distributed systems. There were many students who were working with Hadoop and it is amazing to see Hadoop being the core of so much research activity... when the Hadoop project started three years back, I definitely did not imagine that it will get to this state!

I had read David Patterson's papers during my graduate studies at Univ of Wisconsin Madison, and it was really nice to be able to meet him in person. And the group of students that he leads at the RAD labs is of very high calibre. Most people must have already seen the Above the Cloud whitepaper that the RAD Lab has produced. It tries to clear up the muddle on what Cloud Computing really is, its benefits and its possible usage scenarios. A good read!

A paper titled Detecting Large Scale System Problems by Mining Console Logs talks about using application logs from a distributed application to detect bugs, problems and anomalies in the system. They provide an example whereby they process 24 million log lines produced by HDFS to detect a bug (page 11 of the paper). I am not really convinced about the bug, but this is an effort in the right direction.
My employer Facebook is an enabler for research in distributed systems. To this effect, Facebook has allowed researchers from premier Universities to analyze Hadoop system logs. These logs typically record machine usage and Hadoop job performance metrics. The hadoop job records are inserted into a MySQL database for easy analysis (HADOOP-3708). Some students at the RAD Labs have used this database to prove that Machine Learning techniques can be used to predict Hadoop job performance. This is a paper that is not yet published. There is another paper that analyzes the performance characteristics of the Hadoop Fair share scheduler. Most of the abstract of these publications are available here.

Last, but not the least, SCADS is a scalable storage system that is specifically designed for social networking type software. It has a declarative query language and supports various consistency models. It supports a rich data model that includes joins on pre-computed queries.

HDFS

Thursday, May 28, 2009

Report from my visit to the Berkeley RAD Labs

No comments:

Post a Comment

Search This Blog

StatCounter

Followers