Tuesday, June 29, 2010

America’s Most Wanted – a metric to detect faulty machines in Hadoop

Handling failures in Hadoop
I have been asked many many questions about the failure rates of machines in our Hadoop cluster. These questions vary from the innocuous how much time do you spend everyday fixing bad machines in your cluster to the more involved ones like does failure of hadoop machines depend on the heat-map of the data center? I do not have answer to these questions and this is an area of research that has been of focus lately. But I have seen that the efficiency of a Hadoop cluster is directly dependent on the amount of manpower needed to operate such a cluster.

Common Types of Failures
The three most common categories of failures I have observed are
  • System Errors – Hardware, OS, jvm, hadoop, compiler, etc – Hadoop aims to reduce the effect of this broad category of errors
  • User Application Errors – Bad code written by an user – Bloated memory usage
  • Anomalous Behaviour -- Not working according to expectation – Slow nodes – Causes most harm to Hadoop cluster because they go undetected for long periods of time
America's Most Wanted (AMW)
I have the previledge of working with one of the most experienced Hadoop cluster administrator Andrew Ryan. He observed that a few machines in the Hadoop cluster are always repeat offenders: they land into trouble, gets incarcerated and fixed and then when put back online they create trouble again. He came up with this metric to determine when to throw a machine out of the Hadoop cluster. The following chart shows that 3% of our machines cause 43% of all manual repair events.

This clearly shows the need for a three-strikes-you-are-out law: if a machine goes into repair three times it is better to take it permanently out of the Hadoop cluster.

An exotic question
Other exotic questions that I am frequently asked are like do map-reduce jobs written in python have a higher probability of failure? Here is a chart that tries to answer this question:
  • 5% of all jobs in cluster are written in Python
  • 15% of cluster CPU is consumed by Python jobs
  • 20% of all failed jobs are written in python
This does show that jobs written in python consume more CPU on the average than jobs written in Java. It also shows that a greater percentage of these jobs are likely to fail. Why is this? I do not have a definite answer but I my guess is that a developer is more likely to write the first few version of his experimental query in python because it is an easy-to-prototype language.

I presented a more detailed version of this in a IFIP Working Group on Dependable Computing (DSN2010). The aim of this workshop is to understand more about failure patterns on Hadoop nodes, automatic ways to analyze and handle these failures and how the research community can help Hadoop become more fault-tolerant.