HDFS

Hadoop and Solid State Drives

2012-05-05T22:48:00.001-07:00

Is there a story for the Hadoop Storage Stack (HDFS+HBase) on Solid State Drive (SSD)? This is a question that I have been asked by quite a few people in the last two days, mostly by people at the OpenComputeSummit. This piece discusses the possible use cases of using SSD with Hadoop or HBase.

Use Case
Currently, there are two primary use cases for HDFS: data warehousing using map-reduce and a key-value store via HBase. In the data warehouse case, data is mostly accessed sequentially from HDFS, thus there isn't much benefit from using a SSD to store data. In a data warehouse, a large portion of queries access only recent data, so one could argue that keeping the last few days of data on SSDs could make queries run faster. But most of our map-reduce jobs are CPU bound (decompression, deserialization, etc) and bottlenecked on map-output-fetch; reducing the data access time from HDFS does not impact the latency of a map-reduce job. Another use case would be to put map outputs on SSDs, this could potentially reduce map-output-fetch times, this is one option that needs some benchmarking.

For the secone use-case, HDFS+HBase could theoretically use the full potential of the SSDs to make online-transaction-processing-workloads run faster. This is the use-case that the rest of this blog post tries to address.

Background
The read/write latency of data from a SSD is a magnitude smaller than the read/write latency of a spinning disk storage, this is especially true for random reads and writes. For example, a random read from a SSD takes about 30 micro-seconds while a random read from a spinning disk takes 5 to 10 milliseconds. Also, a SSD device can support 100K to 200K operations/sec while a spinning disk controller can possibly issue only 200 to 300 ops/sec. This means that random reads/writes are not a bottleneck on SSDs. On the other hand, most of our existing database technology is designed to store data in spinning disks, so the natural question is "can these databases harness the full potential of the SSDs"? To answer the above question, we ran two separate artificial random-read workloads, one on HDFS and one on HBase. The goal was to stretch these products to the limit and establish their maximum sustainable throughput on SSDs.

HDFS random-read on cached data

In the first experiment, we created a HDFS cluster with a single NameNode and a single DataNode. We created a 2 GB HDFS file with a HDFS block size of 256 MB and a replication factor of 1. We configured the DataNode to run on a 16 hyper-threaded cores and it stored block-data on xfs. Our benchmark program was co-located on the DataNode machine and had hdfs-read-shortcircuit swicthed on, i.e. the DFSClient bypassed the DataNode and issued read-calls directly to the local xfs filesystem. The entire 2 GB of data was cached in the OS buffer cache and this benchmark did not trigger any IO from disk. The fact that all the data was in the OS cache essentially simulated the behavior of an ultra-fast SSD. We varied the number of client threads and each client thread did a pre-configured number of 16K read calls from HDFS. Since there were only 8 blocks in the file, the DFSClient cached all the block locations of all these 8 blocks and there were no repeatative calls to the NameNode. The first few iterations of this test showed that HDFS can sustain a max random-read-throughput of around 50K ops/sec, but surprisingly the CPU was not maxed out. We found that the read-shortcircuit code path spent considerable time in DNS lookup calls and updating metric-counters. We fixed these two pieces of code and observed that HDFS could sustain a peak random-read-throughput of around 92K ops/sec, the CPUs was now close to 95% usage. HdfsPreadImage is a plot that captures this scenario. The takeaway is that a database that is layered above HDFS would not be able to utilize all the iops offered by a single SSD.

A profiled run of the HDFS code shows that the DFSClient's code path are quite long and causes appreciable impact to throughput for cached random reads. If data-fetch times are in the millisecond range(from spinning disks), the long code paths in the DFSClient do not add appreciable overhead, but when the data is cached in the OS cache (or in SSDs), these code paths need some major rework. Another option would be to write a HDFS readonly-client in C or C++, thereby avoiding some of the overhead of the current Java-based DFSClient.

HBase random-get on cached data

In the second experiment, we did a similar experiment on HBase. We created a single table with a single region and all data was cached in the OS cache of a single HBase regionserver. The OS cache is simulating a super fast SSD device. We used a set of 4 client machines to drive random-get calls to the regionserver. The regionserver was configured to use a maximum of 2000 threads. The HBase table has lzo compression and delta-encoding-fast-diff enabled. Since the data set is cached in OS buffers, this benchmark does not cause any disk io from spinning disks. We saw that the HBase throughput maxes out at around 35K ops/sec and we were not able to drive the CPU usage on that machine to more than 45%. Heavy lock contention and heavy context switching causes the regionserver to not be able to use all the available CPU on the machine. The detailed chart is at Cache4G.

What does this mean

The two experiments show that HBase+HDFS, as it stands today, will not be able to harness the full potential that is offered by SSDs. It is possible that some code restructuring could improve the random-read-throughput of these solutions but my guess is that it will need significant engineering time to make HBase+HDFS sustain a throughput of 200K ops/sec.

These results are not unique to HBase+HDFS. Experiments on other non-Hadoop databases show that they also need to be re-engineered to achieve SSD-capable throughputs. My conclusion is that database and storage technologies would need to be developed from scratch if we want to utilize the full potential of Solid State Devices. The search is on for there new technologies!

Salient features for a BigData Benchmark

2012-02-19T21:21:00.002-08:00

Recently, I was asked to write up about my vision of a BigData Benchmark. That begs the question: What is BigData? Does it refer to a dataset that is large in size, and if so, what is large? Does it refer to the type of data that such a data store contains? Shall we refer to BigData only if it does not conform to a relational schema? Here are some of my random thoughts.

Software industry professionals have started to use the term BigData to refer to data sets that are typically many magnitudes larger than traditional databases. The largest Oracle database or the largest NetApp filer could be many hundred terabytes at most, but BigData refers to storage sets that can scale to many hundred petabytes. Thus, the first and foremost chracteristics of a BigData store is that a single instance of it can be many petabytes in size. These data stores can have a multitude of interfaces, starting from traditional SQL-like queries to customized key-value access methods. Some of them are batch systems while others are interactive systems. Again, some of them are organized for full-scan-index-free access while others have fine-grain indexes and low latency access. How can we design a benchmark(s) for such a wide variety of data stores? Most benchmarks focus on latency and throughput of queries, and rightly so. However, in my opinion, the key to designing a BigData benchmark lies in understanding the deeper commonalities of these systems. A BigData benchmark should measure latencies and throughput, but with a great deal of variations in the workload, skews in the data set and in the presence of faults. I list below some of the common characteristics that distinguish BigData installations from other data storage systems.

Elasticity of resources

A primary feature of a BigData System is that it should be elastic in nature. One should be able to add software and hardware resources when needed. Most BigData installations do not want to pre-provision for all the data that they might collect in the future, and the trick to be cost-efficient is to be able to add resources to a production store without incurring downtime. A BigData system typically has the ability to decommission parts of the hardware and software without off-lining the service, so that obselete or defective hardware can be replaced dynamically. In my mind, this is one of the most important features of a BigData system, thus a benchmark should be able to measure this feature. The benchmark should be such that we can add and remove resources to the system when the benchmark is concurrently executing.

Fault Tolerance

The Elasticity feature described above indirectly implies that the system has to be fault-tolerant. If a workload is running on your system and some parts of the system fails, the other parts of the system should configure themselves to share the work of the failed parts. This means that the service does not fail even in the face of some component failures. The benchmark should measure this aspect of BigData systems. One simple option could be that the benchmark itself introduces component failures as part of its execution.

Skew in the data set

Many big data systems take in un-curated data. That means there are always data points that are extreme outliers and introduces hotspots in the system. The workload on a BigData system is not uniform; some small parts of it is are major hotspots and incur tremendously higher load than the rest of the system. Our benchmarks should be designed to operate on datasets that have large skew and introduce workload hotspots.

There are a few previous attempts to define a unified benchmark for BigData. Dewitt and Stonebraker touched upon a few areas in their SIGMOD paper. They describe experiments that use a grep task, a join task and a simple sql aggregation query. But none of those experiments are done in the presence of system faults, neither do they add or remove hardware when the experiment is in progress. Similarly, the YCSB benchmark proposed by Cooper and Ramakrishnan suffers from the same deficiency.

How would I run the experiments proposed by Dewitt and Stonebraker? Here are some of my early thoughts:

Focus on a 100 node experiment only. This is the setting that is appropriate for BigData systems.
Increase the number of URLs such that the data set is at least a few hundred terabytes.
Make the benchmark run for at least an hour or so. The workload should be a set of multiple queries. Pace the workload so that the there is constant fluctuations in the number of inflight queries.
Introduce skew in the data set. The URL data should be such that maybe 0.1% of those URLs occur 1000 times more frequently that other URLs.
Introduce system faults by killing one of the 100 nodes once every minute, keep it shutdown for a minute, then bring it back online and then continue with process with the remainder of the nodes till the entire benchmark is done.

My hope is that there is somebody out there who can repeat the experiments with the modified settings listed above and present their findings. This research would greatly benefit the BigData community of users and developers!

On a side note, I am working with some of my esteemed colleagues to document a specific data model and custom workload for online serving of queries from a multi-petabyte BigData system. I will write about it in greater detail in a future post.

Realtime Hadoop usage at Facebook: The Complete Story

2011-07-02T18:02:00.000-07:00

I had earlier blogged about why Facebook is starting to use Apache Hadoop technologies to serve realtime workloads. We presented the paper at the SIGMOD 2011 conference and it was very well received.

Here is a link to the complete paper for those who are interested in understanding the details of why we decided to use Hadoop technologies, the workloads that we have on realtime Hadoop, the enhancements that we did to Hadoop for supporting our workloads and the processes and methodologies we have adopted to deploy these workloads successfully. A shortened version of the first two sections of the paper are also described in the slides that you can find here.

Realtime Hadoop usage at Facebook -- Part 2 - Workload Types

2011-05-28T00:56:00.000-07:00

This is the second part of our SIGMOD-2011 paper that describes our use case for Apache Hadoop and Apache HBase in realtime workloads. You can find the first part here. We describe why Hadoop and HBase fits the requirements of each of these applications.

OUR WORKLOADS

Before deciding on a particular software stack and whether or not to move away from our MySQL-based architecture, we looked at a few specific applications where existing solutions may be problematic. These use cases would have workloads that are challenging to scale because of very high write throughput, massive datasets, unpredictable growth, or other patterns that may be difficult or suboptimal in a sharded RDBMS environment.

1. Facebook Messaging

The latest generation of Facebook Messaging combines existing Facebook messages with e-mail, chat, and SMS. In addition to persisting all of these messages, a new threading model also requires messages to be stored for each participating user. As part of the application server requirements, each user will be sticky to a single data center.

1.1 High Write Throughput

With an existing rate of millions of messages and billions of instant messages every day, the volume of ingested data would be very large from day one and only continue to grow. The denormalized requirement would further increase the number of writes to the system as each message could be written several times.

1.2 Large Tables
As part of the product requirements, messages would not be deleted unless explicitly done so by the user, so each mailbox would grow indefinitely. As is typical with most messaging applications, messages are read only a handful of times when they are recent, and then are rarely looked at again. As such, a vast majority would not be read from the database but must be available at all times and with low latency, so archiving would be difficult. Storing all of a user’’s thousands of messages meant that we’’d have a database schema that was indexed by the user with an ever-growing list of threads and messages. With this type of random write workload, write performance will typically degrade in a system like MySQL as the number of rows in the table increases. The sheer number of new messages would also mean a heavy write workload, which could translate to a high number of random IO operations in this type of system.

1.3 Data Migration
One of the most challenging aspects of the new Messaging product was the new data model. This meant that all existing user’’s messages needed to be manipulated and joined for the new threading paradigm and then migrated to the new system. The ability to perform large scans, random access, and fast bulk imports would help to reduce the time spent migrating users to the new system.

2 Facebook Insights

Facebook Insights provides developers and website owners with access to real-time analytics related to Facebook activity across websites with social plugins, Facebook Pages, and Facebook Ads. Using anonymized data, Facebook surfaces activity such as impressions, click through rates and website visits. These analytics can help everyone from businesses to bloggers gain insights into how people are interacting with their content so they can optimize their services. Domain and URL analytics were previously generated in a periodic, offline fashion through our Hadoop and Hive analytics data warehouse. However, this does not yield a rich user experience as the data is only available several hours after it has occurred.

2.1 Realtime Analytics
The insights teams wanted to make statistics available to their users within seconds of user actions rather than the hours previously supported. This would require a large-scale, asynchronous queuing system for user actions as well as systems to process, aggregate, and persist these events. All of these systems need to be fault-tolerant and support more than a million events per second.

2.2 High Throughput Increments
To support the existing insights functionality, time and demographic-based aggregations would be necessary. However, these aggregations must be kept up-to-date and thus processed on the fly, one event at a time, through numeric counters. With millions of unique aggregates and billions of events, this meant a very large number of counters with an even larger number of operations against them.

3. Facebook Metrics System

At Facebook, all hardware and software feed statistics into a metrics collection system called ODS (Operations Data Store). For example, we may collect the amount of CPU usage on a given server or tier of servers, or we may track the number of write operations to an HBase cluster. For each node or group of nodes we track hundreds or thousands of different metrics, and engineers will ask to plot them over time at various granularities. While this application has hefty requirements for write throughput, some of the bigger pain points with the existing MySQL-based system are around the resharding of data and the ability to do table scans for analysis and time roll-ups. This use-case is gearing up to be in production very shortly.

3.1 Automatic Sharding

The massive number of indexed and time-series writes and the unpredictable growth patterns are difficult to reconcile on a sharded MySQL setup. For example, a given product may only collect ten metrics over a long period of time, but following a large rollout or product launch, the same product may produce thousands of metrics. With the existing system, a single MySQL server may suddenly be handling much more load than it can handle, forcing the team to manually re-shard data from this server onto multiple servers.

3.2 Fast Reads of Recent Data and Table Scans
A vast majority of reads to the metrics system is for very recent, raw data, however all historical data must also be available. Recently written data should be available quickly, but the entire dataset will also be periodically scanned in order to perform time- based rollups.

(Credit to the authors of the paper: Dhruba Borthakur Kannan Muthukkaruppan Karthik Ranganathan Samuel Rash Joydeep Sen Sarma Jonathan Gray Nicolas Spiegelberg Hairong Kuang Dmytro Molkov Aravind Menon Rodrigo Schmidt Amitanand Aiyer)

Realtime Hadoop usage at Facebook -- Part 1

2011-05-17T08:50:00.000-07:00

Facebook recently deployed Facebook Messages, its first ever user-facing application built on the Apache Hadoop platform. It uses HDFS and HBase as core technologies for this solution. Since then, there are many more applications that have started to used HBase. We have gained some experience in deploying and operating HDFS and HBase at peta-byte scale for realtime-workloads and decided to write a paper detailing some of these insights. This paper will be published in SIGMOD 2011.

You can find the full paper here later, but here are some highlights:

WHY HADOOP AND HBASE

The requirements for the storage system for our workloads can be summarized as follows:

1. Elasticity: We need to be able to add incremental capacity to our storage systems with minimal overhead and no downtime. In some cases we may want to add capacity rapidly and the system should automatically balance load and utilization across new hardware.

2. High write throughput: Most of the applications store (and optionally index) tremendous amounts of data and require high aggregate write throughput.

3. Efficient and low-latency strong consistency semantics within a data center: There are important applications like Messages that require strong consistency within a data center. This requirement often arises directly from user expectations. For example ‘‘unread’’ message counts displayed on the home page and the messages shown in the inbox page view should be consistent with respect to each other. While a globally distributed strongly consistent system is practically impossible, a system that could at least provide strong consistency within a data center would make it possible to provide a good user experience. We also knew that (unlike other Facebook applications), Messages was easy to federate so that a particular user could be served entirely out of a single data center making strong consistency within a single data center a critical requirement for the Messages project. Similarly, other projects, like realtime log aggregation, may be deployed entirely within one data center and are much easier to program if the system provides strong consistency guarantees.

4. Efficient random reads from disk: In spite of the widespread use of application level caches (whether embedded or via memcached), at Facebook scale, a lot of accesses miss the cache and hit the back-end storage system. MySQL is very efficient at performing random reads from disk and any new system would have to be comparable.

5. High Availability and Disaster Recovery: We need to provide a service with very high uptime to users that covers both planned and unplanned events (examples of the former being events like software upgrades and addition of hardware/capacity and the latter exemplified by failures of hardware components). We also need to be able to tolerate the loss of a data center with minimal data loss and be able to serve data out of another data center in a reasonable time frame.
6. Fault Isolation: Our long experience running large farms of MySQL databases has shown us that fault isolation is critical. Individual databases can and do go down, but only a small fraction of users are affected by any such event. Similarly, in our warehouse usage of Hadoop, individual disk failures affect only a small part of the data and the system quickly recovers from such faults.

7. Atomic read-modify-write primitives: Atomic increments and compare-and-swap APIs have been very useful in building lockless concurrent applications and are a must have from the underlying storage system.

8. Range Scans: Several applications require efficient retrieval of a set of rows in a particular range. For example all the last 100 messages for a given user or the hourly impression counts over the last 24 hours for a given advertiser.

It is also worth pointing out non-requirements:

1. Tolerance of network partitions within a single data center: Different system components are often inherently centralized. For example, MySQL servers may all be located within a few racks, and network partitions within a data center would cause major loss in serving capabilities therein. Hence every effort is made to eliminate the possibility of such events at the hardware level by having a highly redundant network design.

2. Zero Downtime in case of individual data center failure: In our experience such failures are very rare, though not impossible. In a less than ideal world where the choice of system design boils down to the choice of compromises that are acceptable, this is one compromise that we are willing to make given the low occurrence rate of such events. We might revise this non-requirement at a later time.

3. Active-active serving capability across different data centers: As mentioned before, we were comfortable making the assumption that user data could be federated across different data centers (based ideally on user locality). Latency (when user and data locality did not match up) could be masked by using an application cache close to the user.

Some less tangible factors were also at work. Systems with existing production experience for Facebook and in-house expertise were greatly preferred. When considering open-source projects, the strength of the community was an important factor. Given the level of engineering investment in building and maintaining systems like these –– it also made sense to choose a solution that was broadly applicable (rather than adopt point solutions based on differing architecture and codebases for each workload).

After considerable research and experimentation, we chose Hadoop and HBase as the foundational storage technology for these next generation applications. The decision was based on the state of HBase at the point of evaluation as well as our confidence in addressing the features that were lacking at that point via in- house engineering. HBase already provided a highly consistent, high write-throughput key-value store. The HDFS NameNode stood out as a central point of failure, but we were confident that our HDFS team could build a highly-available NameNode (AvatarNode) in a reasonable time-frame, and this would be useful for our warehouse operations as well. Good disk read-efficiency seemed to be within striking reach (pending adding Bloom filters to HBase’’s version of LSM Trees, making local DataNode reads efficient and caching NameNode metadata). Based on our experience operating the Hive/Hadoop warehouse, we knew HDFS was stellar in tolerating and isolating faults in the disk subsystem. The failure of entire large HBase/HDFS clusters was a scenario that ran against the goal of fault-isolation, but could be considerably mitigated by storing data in smaller HBase clusters. Wide area replication projects, both in-house and within the HBase community, seemed to provide a promising path to achieving disaster recovery.

HBase is massively scalable and delivers fast random writes as well as random and streaming reads. It also provides row-level atomicity guarantees, but no native cross-row transactional support. From a data model perspective, column-orientation gives extreme flexibility in storing data and wide rows allow the creation of billions of indexed values within a single table. HBase is ideal for workloads that are write-intensive, need to maintain a large amount of data, large indices, and maintain the flexibility to scale out quickly.

HBase is now being used by many other workloads internally at Facebook . I will describe these different workloads in a later post.

Data warehousing at Facebook

2011-04-14T00:15:00.001-07:00

Many people have asked me to describe the best practices that we have adopted to run a multi PB data warehouse using Hadoop. Most of the details were described in a paper that we presented at SIGMOD 2010. This document refers to our state-of-affairs as it was about a year back, but is still an interesting read. Below is the abstract of this paper. You can find the complete paper here.

ABSTRACT

Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook - both engineering and non- engineering. Apart from ad hoc analysis of data and creation of business intelligence dashboards by analysts across the company, a number of Facebook's site features are also based on analyzing large data sets. These features range from simple reporting applications like Insights for the Facebook Advertisers, to more advanced kinds such as friend recommendations. In order to support this diversity of use cases on the ever increasing amount of data, a flexible infrastructure that scales up in a cost effective manner, is critical. We have leveraged, authored and contributed to a number of open source technologies in order to address these requirements at Facebook. These include Scribe, Hadoop and Hive which together form the cornerstones of the log collection, storage and analytics infrastructure at Facebook. In this paper we will present how these systems have come together and enabled us to implement a data warehouse that stores more than 15PB of data (2.5PB after compression) and loads more than 60TB of new data (10TB after compression) every day. We discuss the motivations behind our design choices, the capabilities of this solution, the challenges that we face in day today operations and future capabilities and improvements that we are working on.

Facebook has opensourced the version of Apache Hadoop that we use to power our production clusters. You can find more details about our usage of Hadoop at the Facebook Engineering Blog.

Hadoop Research Topics

2010-11-21T08:22:00.000-08:00

Recently, I visited a few premier educational institutes in India, e.g. Indian Institute of Technology (IIT) at Delhi and Guwahati. Most of the undergraduate students at these two institutes are somewhat familiar with Hadoop and would like to work on Hadoop related projects as part of their course work. One commonly asked question that I got from these students is what Hadoop feature can I work on?

Here are some items that I have in mind that are good topics for students to attempt if they want to work in Hadoop.

Ability to make Hadoop scheduler resource aware, especially CPU, memory and IO resources. The current implementation is based on statically configured slots.
Abilty to make a map-reduce job take new input splits even after a map-reduce job has already started.
Ability to dynamically increase replicas of data in HDFS based on access patterns. This is needed to handle hot-spots of data.
Ability to extend the map-reduce framework to be able to process data that resides partly in memory. One assumption of the current implementation is that the map-reduce framework is used to scan data that resides on disk devices. But memory on commodity machines is becoming larger and larger. A cluster of 3000 machines with 64 GB each can keep about 200TB of data in memory! It would be nice if the hadoop framework can support caching the hot set of data on the RAM of the tasktracker machines. Performance should increase dramatically because it is costly to serialize/compress data from the disk into memory for every query.
Heuristics to efficiently 'speculate' map-reduce tasks to help work around machines that are laggards. In the cloud, the biggest challenge for fault tolerance is not to handle failures but rather anomalies that makes parts of the cloud slow (but not fail completely), these impact performance of jobs.
Make map-reduce jobs work across data centers. In many cases, a single hadoop cluster cannot fit into a single data center and a user has to partition the dataset into two hadoop clusters in two different data centers.
High Availability of the JobTracker. In the current implementation, if the JobTracker machine dies, then all currently running jobs fail.
Ability to create snapshots in HDFS. The primary use of these snapshots is to retrieve a dataset that was erroneously modified/deleted by a buggy application.

The first thing for a student who wants to do any of these projects is to download the code from HDFS and MAPREDUCE. Then create an account in the bug tracking software here. Please search for an existing JIRA that describes your project; if none exists then please create a new JIRA. Then please write a design document proposal so that the greater Apache Hadoop community can deliberate on the proposal and post this document to the relevant JIRA.

If anybody else have any new project ideas, please add them as comments to this blog post.

America’s Most Wanted – a metric to detect faulty machines in Hadoop

2010-06-29T22:50:00.001-07:00

Handling failures in Hadoop

I have been asked many many questions about the failure rates of machines in our Hadoop cluster. These questions vary from the innocuous how much time do you spend everyday fixing bad machines in your cluster to the more involved ones like does failure of hadoop machines depend on the heat-map of the data center? I do not have answer to these questions and this is an area of research that has been of focus lately. But I have seen that the efficiency of a Hadoop cluster is directly dependent on the amount of manpower needed to operate such a cluster.

Common Types of Failures

The three most common categories of failures I have observed are

System Errors – Hardware, OS, jvm, hadoop, compiler, etc – Hadoop aims to reduce the effect of this broad category of errors
User Application Errors – Bad code written by an user – Bloated memory usage
Anomalous Behaviour -- Not working according to expectation – Slow nodes – Causes most harm to Hadoop cluster because they go undetected for long periods of time

America's Most Wanted (AMW)

I have the previledge of working with one of the most experienced Hadoop cluster administrator Andrew Ryan. He observed that a few machines in the Hadoop cluster are always repeat offenders: they land into trouble, gets incarcerated and fixed and then when put back online they create trouble again. He came up with this metric to determine when to throw a machine out of the Hadoop cluster. The following chart shows that 3% of our machines cause 43% of all manual repair events.

This clearly shows the need for a three-strikes-you-are-out law: if a machine goes into repair three times it is better to take it permanently out of the Hadoop cluster.

An exotic question

Other exotic questions that I am frequently asked are like do map-reduce jobs written in python have a higher probability of failure? Here is a chart that tries to answer this question:

5% of all jobs in cluster are written in Python
15% of cluster CPU is consumed by Python jobs
20% of all failed jobs are written in python

This does show that jobs written in python consume more CPU on the average than jobs written in Java. It also shows that a greater percentage of these jobs are likely to fail. Why is this? I do not have a definite answer but I my guess is that a developer is more likely to write the first few version of his experimental query in python because it is an easy-to-prototype language.

I presented a more detailed version of this in a IFIP Working Group on Dependable Computing (DSN2010). The aim of this workshop is to understand more about failure patterns on Hadoop nodes, automatic ways to analyze and handle these failures and how the research community can help Hadoop become more fault-tolerant.

Facebook has the world's largest Hadoop cluster!

2010-05-09T01:14:00.000-07:00

It is not a secret anymore!

The Datawarehouse Hadoop cluster at Facebook has become the largest known Hadoop storage cluster in the world. Here are some of the details about this single HDFS cluster:

21 PB of storage in a single HDFS cluster
2000 machines
12 TB per machine (a few machines have 24 TB each)
1200 machines with 8 cores each + 800 machines with 16 cores each
32 GB of RAM per machine
15 map-reduce tasks per machine

That's a total of more than 21 PB of configured storage capacity! This is larger than the previously known Yahoo!'s cluster of 14 PB. Here are the cluster statistics from the HDFS cluster at Facebook:

Hadoop started at Yahoo! and full marks to Yahoo! for developing such critical infrastructure technology in the open. I started working with Hadoop when I joined Yahoo! in 2006. Hadoop was in its infancy at that time and I was fortunate to be part of the core set of Hadoop engineers at Yahoo!. Many thanks to Doug Cutting for creating Hadoop and Eric14 for convincing the executing management at Yahoo! to develop Hadoop as open source software.

Facebook engineers work closely with the Hadoop engineering team at Yahoo! to push Hadoop to greater scalability and performance. Facebook has many Hadoop clusters, the largest among them is the one that is used for Datawarehousing. Here are some statistics that describe a few characteristics of the Facebook's Datawarehousing Hadoop cluster:

12 TB of compressed data added per day
800 TB of compressed data scanned per day
25,000 map-reduce jobs per day
65 millions files in HDFS
30,000 simultaneous clients to the HDFS NameNode

A majority of this data arrives via scribe, as desribed in scribe-hdfs integration. This data is loaded in Hive. Hive provides a very elegant way to query the data stored in Hadoop. Almost 99.9% Hadoop jobs at Facebook are generated by a Hive front-end system. We provide lots more details about our scale of operations in our paper at SIGMOD titled Datawarehousing and Analytics Infrastructure at Faceboo k.

Here are two pictorial representations of the rate of growth of the Hadoop cluster:

Details about our Hadoop configuration

I have fielded many questions from developers and system administrators about the Hadoop configuration that is deployed in the Facebook Hadoop Datawarehouse. Some of these questions are from Linux kernel developers who would like to make Linux swapping work better with Hadoop workload; other questions are from JVM developers who may attempt to make Hadoop run faster for processes with large heap size; yet others are from GPU architects who would like to port a Hadoop workload to run on GPUs. To enable this type of outside research, here are the details about the Facebook's Hadoop warehouse configurations. I hope this open sharing of infrastructure details from Facebook jumpstarts the research community to design ways and means to optimize systems for Hadoop usage.

The Curse of the Singletons! The Vertical Scalability of Hadoop NameNode

2010-04-25T21:42:00.000-07:00

Introduction

HDFS is designed to be a highly scalable storage system and sites at Facebook and Yahoo have 20PB size file systems in production deployments. The HDFS NameNode is the master of the Hadoop Distributed File System (HDFS). It maintains the critical data structures of the entire file system. Most of HDFS design has focussed on scalability of the system, i.e. the ability to support a large number of slave nodes in the cluster and an even larger number of files and blocks. However, a 20PB size cluster with 30K simultaneous clients requesting service from a single NameNode means that the NameNode has to run on a high-end non-commodity machine. There has been some efforts to scale the NameNode horizontally, i.e. allow the NameNode to run on multiple machines. I will defer analyzing those horizontal-scalability-efforts for a future blog post, instead let's discuss ways and means to make our singleton NameNode support an even greater load.

What are the bottlenecks of the NameNode?

Network: We have around 2000 nodes in our cluster and each node is running 9 mappers and 6 reducers simultaneously. This means that there are around 30K simultaneous clients requesting service from the NameNode. The Hive Metastore and the HDFS RaidNode imposes additional load on the NameNode. The Hadoop RPCServer has a singleton Listener Thread that pulls data from all incoming RPCs and hands it to a bunch of NameNode handler threads. Only after all the incoming parameters of the RPC are copied and deserialized by the Listener Thread does the NameNode handler threads get to process the RPC. One CPU core on our NameNode machine is completely consumed by the Listener Thread. This means that during times of high load, the Listener Thread is unable to copy and deserialize all incoming RPC data in time, thus leading to clients encountering RPC socket errors. This is one big bottleneck to vertically scalabiling of the NameNode.

CPU: The second bottleneck to scalability is the fact that most critical sections of the NameNode is protected by a singleton lock called the FSNamesystem lock. I had done some major restructuring of this code about three years ago via HADOOP-1269 but even that is not enough for supporting current workloads. Our NameNode machine has 8 cores but a fully loaded system can use at most only 2 cores simultaneously on the average; the reason being that most NameNode handler threads encounter serialization via the FSNamesystem lock.

Memory: The NameNode stores all its metadata in the main memory of the singleton machine on which it is deployed. In our cluster, we have about 60 million files and 80 million blocks; this requires the NameNode to have a heap size of about 58GB. This is huge! There isn't any more memory left to grow the NameNode's heap size! What can we do to support even greater number of files and blocks in our system?

Can we break the impasse?

RPC Server: We enhanced the Hadoop RPC Server to have a pool of Reader Threads that work in conjunction with the Listener Thread. The Listener Thread accepts a new connection from a client and then hands over the work of RPC-parameter-deserialization to one of the Reader Threads. In our case, we configured our system so that the Reader Threads consist of 8 threads. This change has doubled the number of RPCs that the NameNode can process at full throttle. This change has been contributed to the Apache code via HADOOP-6713.

The above change allowed a simulated workload to be able to consume 4 CPU cores out of a total of 8 CPU cores in the NameNode machine. Sadly enough, we still cannot get it to use all the 8 CPU cores!

FSNamesystem lock: A review of our workload showed that our NameNode typically has the following distribution of requests:

stat a file or directory 47%
open a file for read 42%
create a new file 3%
create a new directory 3%
rename a file 2%
delete a file 1%

The first two operations constitues about 90% workload for the NameNode and are readonly operations: they do not change file system metadata and do not trigger any synchronous transactions (the access time of a file is updated asynchronously). This means that if we change the FSnamesystem lock to a Readers-Writer lock we can achieve the full power of all processing cores in our NameNode machine. We did just that, and we saw yet another doubling of the processing rate of the NameNode! The load simulator can now make the NameNode process use all 8 CPU cores of the machine simultaneously. This code has been contributed to Apache Hadoop via HDFS-1093.

The memory bottleneck issue is still unresolved. People have asked me if the NameNode can keep some portion of its metadata in disk, but this will require a change in locking model design first. One cannot keep the FSNamesystem lock while reading in data from the disk: this will cause all other threads to block thus throttling the performance of the NameNode. Could one use flash memory effectively here? Maybe an LRU cache of file system metadata will work well with current metadata access patterns? If anybody has good ideas here, please share it with the Apache Hadoop community.

In a Nutshell

The two proposed enhancements have improved NameNode scalability by a factor of 8. Sweet, isn't it?

Hadoop AvatarNode High Availability

2010-02-06T17:00:00.000-08:00

Our Use-Case

The Hadoop Distributed File System's (HDFS) NameNode is a single point of falure. This has been a major stumbling block in using HDFS for a 24x7 type of deployment. It has been a topic of discussion among a wide circle of engineers.

I am part of a team that is operating a cluster of 1200 nodes and a total size of 12 PB. This cluster is currently running hadoop 0.20. The NameNode is configured to write its transaction logs to a shared NFS filer. The biggest cause of service downtime is when we need to deploy hadoop patches to our cluster. A fix to the DataNode software is easy to deploy without cluster downtime: we deploy new code and restart few DataNodes at a time. We can afford to do this because the DFSClient is equipped (via multiple retries to different replicas of the same block) to handle transient DataNode outages. The major problem arises when new software have to be deployed to our NameNode. A HDFS NameNode restart for a cluster of our size typically takes about an hour. During this time, the applications that use the Hadoop service cannot run. This led us to believe that some kind of High Availabilty (HA) for the NameNode is needed. We wanted a design than can support a failover in about a minute.

Why AvatarNode?

We took a serious look at the options available to us. The HDFS BackupNode is not available in 0.20, and upgrading our cluster to hadoop 0.20 is not a feasible option for us because it has to go through extensive time-consuming testing before we can deploy it in production. We started off white-boarding the simplest solution: the existance of a Primary NameNode and a Standby NameNode in our HDFS cluster. And we did not have to worry about split-brain-scenario and IO-fencing methods... the administrator ensures that the Primary NameNode is really, really dead before coverting the Standby NamenNode to become the Primary DataNode. We also did not want to change the code of the NameNode one single bit and wanted to build HA as a software layer on top of HDFS. We wanted a daemon that can switch from being a Primary to a Standby and vice-versa... as if switching avatars ... and this is how our AvatarNode is born!

The word avatar means a variant-phase or incarnations of a single entity, and we thought it appropriate that a wrapper that can make the NameNode behave as two different roles be aptly named as AvatarNode. Our design of the AvatarNode is such that it is a pure-wrapper around the existing NameNode that exists in Hadoop 0.20; thus the AvatarNode inherits the scalability, performance and robustness of the NameNode.

What is a AvatarNode?

The AvatarNode encapsulates an instance of the NameNode. You can start a AvatarNode in the Primary avatar or the Standby avatar. If you start it as a Primary avatar, then it behaves exactly as the NameNode you know now. In fact, it runs exactly the same code as the NameNode. The AvatarNode in its Primary avatar writes the HDFS transaction logs into the shared NFS filer. You can then start another instance of the AvatarNode on a different machine, but this time telling it to start as a Standby avatar.

The Standby AvatarNode encapsulates a NameNode and a SecondaryNameNode within it. The Standby AvatarNode continuously keeps reading the HDFS transaction logs from the same NFS filer and keeps feeding those transactions to the encapsulated NameNode instance. The NameNode that runs within the Standby AvatarNode is kept in SafeMode. Keeping the Standby AvatarNode is SafeMode prevents it from performing any active duties of the NameNode but at the same time keeping a copy of all NameNode metadata hot in its in-memory data structures.

HDFS clients are configured to access the NameNode via a Virtual IP Address (VIP). At the time of failover, the administrator kills the Primary AvatarNode on machine M1 and instructs the Standby AvatarNode on machine M2 (via a command line utility) to assume a Primary avatar. It is guaranteed that the Standby AvatarNode ingests all committed transactions because it reopens the edits log and consumes all transactions till the end of the file; this guarantee depends on the fact that NFS-v3 supports close-to-open cache coherency semantics. The Standby AvatarNode finishes ingestion of all transactions from the shared NFS filer and then leaves SafeMode. Then the administrator switches the VIP from M1 to M2 thus causing all HDFS clients to start accessing the NameNode instance on M2. This is practically instantaneous and the failover typically takes a few seconds!

There is no need to run a separate instance of the SecondaryNameNode. The AvatarNode, when run in the Standby avatar, performs the duties of the SecondaryNameNode too. The Standby AvatarNode comtinually ingests transactions from the Primary, but once a while it pauses the ingestion, invokes the SecondaryNameNode to create and upload a checkpoint to the Primary AvatarNode and then resumes the ingestion of transaction logs.

The AvatarDataNode

The AvatarDataNode is a wrapper around the vanilla DataNode found in hadoop 0.20. It is configured to send block reports and blockReceived messages to two AvatarNodes. The AvatarDataNode do not use the VIP to talk to the AvatarNode(s) (only HDFS clients use the VIP). An alternative to this approach would have been to make the Primary AvatarNode forward block reports to the Standby AvatarNode. We discounted this alternative approach because it adds code complexity to the Primary AvatarNode: the Primary AvatarNode would have to do buffering and flow control as part of forwarding block reports.

The Hadoop DataNode is already equipped to send (and retry if needed) blockReceived messages to the NameNode. We extend this functionality by making the AvatarDataNode send blockReceived messages to both the Primary and Standby AvatarNodes.

Does the failover affect HDFS clients?

An HDFS client that was reading a file caches the location of the replicas of almost all blocks of a file at the time of opening the file. This means that HDFS readers are not impacted by AvatarNode failover.

What happens to files that were being written by a client when the failover occurred? The Hadoop 0.20 NameNode does not record new block allocations to a file until the time when the file is closed. This means that a client that was writing to a file when the failover occured will encounter an IO exception after a successful failover event. This is somewhat tolerable for map-reduce job because the Hadoop MapReduce framework will retry any failed tasks. This solution also works well for HBase because the HBase issues a sync/hflush call to persist HDFS file contents before marking a HBase-transaction to be completed. In future, it is possible that HDFS may be enhanced to record every new block allocation in the HDFS transaction log, details here. In that case, a failover event will not impact HDFS writers as well.

Other NameNode Failover techniques

People have asked me to compare the AvatarNode HA implementation with the Hadoop with DRBD and Linux HA. Both approaches require that the Primary writes transaction logs to a shared device. The difference is that the Standby AvatarNode is a hot standby whereas the DRBD-LinuxHA solution is a cold standby. The failover time for our AvatarNode is about a minute, whereas the failover time of the DRBD-LinuxHA would require about an hour for our cluster that has about 50 million files. The reason that the AvatarNode can function as a hot standby is because the Standby AvatarNode has an instance of the NameNode encapsulated within it and this encapsulated NameNode receives messages from the DataNodes thus keeping its metadata state fully upto-date.

Another implementation that I have heard about is from a team at China Mobile Research Institue, their approach is described here.

Where is the code?

This code has been contributed to the Apache HDFS project via HDFS-976. A prerequisite for this patch is HDFS-966.

HDFS High Availability

2009-11-09T00:32:00.001-08:00

I have encountered plenty of questions about the single point of failure for the HDFS NameNode. The most common concern being that if the NameNode dies, then the whole cluster is unavailable. This means that HDFS is unsuitable for applications that need a high degree of uptime. This is not a problem when you run map-reduce jobs on HDFS, especially because a map-reduce system is a batch system and the uptime requirements of a batch system is typically not that stringent.

In recent times, I am starting to see lots of distributed applications that are using HDFS as a general purpose storage system. These applications range from multimedia servers that store mails, photos, videos, etc to systems that store updates from thousands of distributed sensors. These applications prefer HDFS because they can store a large volume of data. These applications do not use map-reduce to mine any of the information stored in HDFS. However, these applications do need consistency and availability of data. The Brewer's CAP Theorem states that most distributed systems need some tradeoffs among Consistency, Availability and Partition tolerance. HDFS's does an excellent job of providing Consistency of data at all times. Traditionally, it did not address Availability and Partition tolerance earlier. That could change to a certain extent with HDFS 0.21 release. HDFS 0.21 has a new entity called the BackupNode that receives real-time updates about transactions from the NameNode. This is the first step in making making HDFS highly available.

Here is a slide-deck that I wanted to present at ApacheCon 2009 about the current state of affairs regarding High Availabilty with HDFS.

Hadoop discussions at Microsoft Research

2009-10-19T02:20:00.000-07:00

I was invited to present a talk about Hadoop File System Architecture at Microsoft Research at Seattle. This is a research group and is focussed on long-term research, so it is no surprise that they are interested in knowing how a growing company like Facebook is using Hadoop to its advantage.

I met a few folks who chatted with me about how Microsoft SQL Server is being modified to handle large scala databases. These folks heartily agreed with a comment I made in my presentation that Dr. Dewitt and Dr. Stonebraker is missing the point when they are comparing performance numbers between Hadoop with traditional Database systems.... rather than comparing the scalability and fault-tolerance of these systems. I had learned some of the fundamentals of Database systems from Professor Dewitt during my graduate studies at Uiversity of Wisconsin Madison, but Dr Dewitt is a Microsoft employee now!

The fact that Facebook uses the SQL interface of Hive layered over Hadoop makes it even more interesting to Microsoft. They wanted to know the performance difference between Hive and PIG and would like to compare them to their distributed-SQL-Server software.

Here are the slides I used for my presentation.

2009-10-02T11:57:00.000-07:00

I presented a set of slides that describes the Hadoop development at Facebook at the HadoopWorld conference in New York today. It was well received by more than 100 people. I have presented at many-a-conferences in the west coast but this is the first time I have presented at a conference in New York... there are more hadoop users here versus mostly hadoop developers in the west coast. There were plenty of questions, especially about Hadoop-Archive and Realtime-Hadoop. There were people asking me questions about HDFS Symbolic links and HDFS-scribe copier.

Earlier, I visited the university of Notre Dame to conduct a department seminar and present a guest lecture for the graduate students at the Department of Computer Science. There is plenty of interesting research being led by Prof Douglas Thain. One interesting research idea that came up was to place HDFS block replicas by analyzing HDFS access patterns. It is possible that we can provide HDFS datanode/namenode logs to researchers who can analyze these logs to come up with better algorithms for HDFS block replica placement.

HDFS block replica placement in your hands now!

2009-09-14T22:41:00.000-07:00

Most Hadoop administrators set the default replication factor for their files to be three. The main assumption here is that if you keep three copies of the data, your data is safe. I have observed this to be true in the big clusters that we manage and operate. In actuality, administrators are managing two failure aspects: data corruption and data availability.

If all the datanodes on which the replicas of a block exist catch fire at the same time, then that data is lost and cannot be recovered. Or if an administrative error causes all the existing replicas of a block to be deleted, then it is a catastrophic failure. This is data corruption. On the other hand, if a rack switch goes down for sometime, the datanodes on that rack are in-accessible during that time. When that faulty rack switch is fixed, the data on the rack rejoins the HDFS cluster and life goes on as usual. This is a data avilability issue; in this case data was not corrupted or lost, it was just unavailable for some time. HDFS keeps three copies of a block on three different datanodes to protect against true data corruption. HDFS also tries to distribute these three replicas on more than one rack to protect against data availability issues. The fact that HDFS actively monitors any failed datanode(s) and upon failure detection immediately schedules re-replication of blocks (if needed) implies that three copies of data on three different nodes is sufficient to avoid corrupted files.

HDFS uses a simple but highly effective policy to allocate replicas for a block. If a process that is running on any of the HDFS cluster nodes open a file for writing a block, then one replica of that block is allocated on the same machine on which the client is running. The second replica is allocated on a randomly chosen rack that is different from the rack on which the first replica was allocated. The third replica is allocated on a randomly chosen machine on the same remote rack that was chosen in the earlier step. This means that a block is present on two unique racks. One point to note is that there is no relationship between replicas of different blocks of the same file as far as their location is concerned. Each block is allocated independently.

The above algorithm is great for availability and scalability. However, there are scenarios where co-locating many block of the same file on the same set of datanode(s) or rack(s) is beneficial for performance reasons. For example, if many blocks of the same file are present on the same datanode(s), a single mapper instance could process all these blocks using the CombineFileInputFormat. Similarly, if a dataset contains many small files that are co-located on the same datanode(s) or rack(s), one can use CombineFileInputFormat to process all these file together by using fewer mapper instances via CombineFileInputFormat. If an application always uses one dataset with another dataset (think Hive or Pig join), then co-locating these two datasets on the same set of datanodes is beneficial.

Another reason when one might want to allocate replicas using a different policy is to ensure that replicas and their parity blocks truly reside in different failure domains. The erasure code work in HDFS could effectively bring down the physical replication factor of a file to about 1.5 (while keeping the logical replication factor at 3) if it can place replicas of all blocks in a stripe more intelligently.

Yet another reason, however exotic, is to allow HDFS to place replicas based on the HeatMap of your cluster. If one of of the node in the cluster is at a higher temperature than that of another, then it might be better to prefer the cooler node while allocating a new replica. If you want to experiment with HDFS across two data centers, you might want to try out new policies for replica placement.

Well, now you can finally get your hands wet! HDFS-385 is part of the Hadoop trunk and will be part of the next major HDFS 0.21 release. This feature provides a way for the adventurous developer to write Java code that specifies how HDFS should allocate replicas of blocks of a file. The API is experimental in nature, and could change in the near future if we discover any in-efficiencies in it. Please let the Hadoop community know if you need any changes in this API or if you come across novel uses of this API.

HDFS and Erasure Codes (HDFS-RAID)

2009-08-28T22:58:00.000-07:00

The Hadoop Distributed File System has been great in providing a cloud-type file system. It is robust (when administered correctly :-)) and highly scalable. However, one of the main drawbacks of HDFS is that each piece of data is replicated in three places. This is acceptable because disk storage is cheap and is becoming cheaper by the day; this isn't a problem if you have a relatively small to medium size cluster. The price difference (in absolute terms) is not much whether you use 15 disks or whether you use 10 disks. If we consider the cost of $1 per GByte, the price difference between fifteen 1 TB disk and ten 1 TB disk is only $5K. But when the total size of your cluster is 10 PBytes, then the costs savings in storing the data in two places versus three is a huge ten million dollars!

The reason HDFS stores disk blocks in triplicate is because it uses commodity hardware and there is non-negligible probability of a disk failure. It has been observed that a replication factor of 3 and the fact the HDFS aggressively detects failures and immediately replicates failed -block-replicas is sufficient to never lose any data in practice. The challenge now is to achieve an effective replication factor of 3 while keeping the real physical replication factor at close to 2! How best to do it than by using Erasure Codes.

I heard about this idea called DiskReduce from the folks at CMU. The CMU PDL Labs has been a powerhouse of research in file systems and it is no surprise that they proposed a elegant way of implementing erasure codes in HDFS. I borrowed heavily from their idea in my implementation of Erasure Codes in HDFS described in HDFS-503. One of the main motivation of my design is to keep the HDFS Erasure Coding as a software layer above HDFS rather than inter-twining it inside of HDFS code. The HDFS code is complex by itself and it is really nice to not have to make it more complex and heavyweight.

Distributed Raid File System consists of two main software components. The first component is the RaidNode, a daemon that creates parity files from specified HDFS files. The second component "raidfs" is a software that is layered over a HDFS client and it intercepts all calls that an application makes to the HDFS client. If the HDFS client encounters corrupted data while reading a file, the raidfs client detects it; it uses the relevant parity blocks to recover the corrupted data (if possible) and returns the data to the application. The application is completely transparent to the fact that parity data was used to satisfy it's read request. The Distributed Raid File System can be configured in such a way that a set of data blocks of a file are combined together to form one or more parity blocks. This allows one to reduce the replication factor of a HDFS file from 3 to 2 while keeping the failure probabilty relatively same as before.

I have seen that using a stripe size of 10 blocks decreases the physical replication factor of a file to 2.2 while keeping the effective replication factor of a file at 3. This typically results in saving 25% to 30% of storage space in a HDFS cluster.

One of the shortcoming of this implementation is that we need a parity file for every file in HDFS. This potentially increases the number of files in the NameNode. To alleviate this problem, I will enhance this implementation (in future) to use the Hadoop Archive feature to archive all the parity files together in larger containers so that the NameNode does not have to support additional files when the HDFS Erasure Coding is switched on. This works reasonably well because it is a very very rare case that the parity files are ever used to satisfy a read request.

I am hoping that this feature becomes part of Hadoop 0.21 release scheduled for September 2009!

Hadoop and Condor

2009-07-28T23:28:00.000-07:00

My graduate work in the mid-nineties at the University of Wisconsin focussed on Condor. Condor has an amazing way to do process checkpointing and migrating processes from one machine to another if needed. It also has a very powerful scheduler that matches job requirements with machine characteristics.

One of the major inefficiencies with Hadoop schedulers (Fairshare and Capacity scheduler) is that they are not resource-aware. There has been some work-in-progress in this area, HADOOP-5881. Condor's ClassAds mechanism can be used to match hadoop jobs with machines very elegantly.

Here is one of my recent presentation at the The Israeli Association of Grid Technologies that talks about the synergies between Condor and Hadoop.

Hadoop at Netflix

2009-06-22T23:53:00.000-07:00

Netflix is interested in using Hadooo/Hive to process click logs from the users of their website. Here is what I presented to them in a meeting that was well attended by about 50 engineers. Following the meeting, a bunch of engineers asked me question related to the integration of scribe and hdfs and how Facebook imports click logs into Hadoop.

Here is a copy of my presentation .. slides.

HDFS Scribe Integration

2009-06-06T00:34:00.000-07:00

It is finally here: you can configure the open source log-aggregator, scribe, to log data directly into the Hadoop distributed file system.

Many Web 2.0 companies have to deploy a bunch of costly filers to capture weblogs being generated by their application. Currently, there is no option other than a costly filer because the write-rate for this stream is huge. The Hadoop-Scribe integration allows this write-load to be distributed among a bunch of commodity machines, thus reducing the total cost of this infrastructure.

The challenge was to make HDFS be real-timeish in behaviour. Scribe uses libhdfs which is the C-interface to the HDFs client. There were various bugs in libhdfs that needed to be solved first. Then came the FileSystem API. One of the major issues was that the FileSystem API caches FileSystem handles and always returned the same FileSystem handle when called from multiple threads. There was no reference counting of the handle. This caused problems with scribe, because Scribe is highly multi-threaded. A new API FileSystem.newInstance() was introduced to support Scribe.

Making the HDFS write code path more real-time was painful. There are various timeouts/settings in HDFS that were hardcoded and needed to be changed to allow the application to fail fast. At the bottom of this blog-post, I am attaching the settings that we have currently configured to make the HDFS-write very real-timeish. The last of the JIRAS, HADOOP-2757 is in the pipeline to be committed to Hadoop trunk very soon.

What about Namenode being the single point of failure? This is acceptable in a warehouse type of application but cannot be tolerated by a realtime application. Scribe typically aggregates click-logs from a bunch of webservers, and losing *all* click log data of a website for a 10 minutes or so (minimum time for a namenode restart) cannot be tolerated. The solution is to configure two overlapping clusters on the same hardware. Run two separate namenodes N1 and N2 on two different machines. Run one set of datanode software on all slave machines that report to N1 and the other set of datanode software on the same set of slave machines that report to N2. The two datanode instances on a single slave machine share the same data directories. This configuration allows HDFS to be highly available for writes!

The highly-available-for-writes-HDFS configuration is also required for software upgrades on the cluster. We can shutdown one of the overlapping HDFS clusters, upgrade it to new hadoop software, and then put it back online before starting the same process for the second HDFS cluster.

What are the main changes to scribe that were needed? Scribe already had the feature that it buffers data when it is unable to write to the configured storage. The default scribe behaviour is to replay this buffer back to the storage when the storage is back online. Scribe is configured to support no-buffer-replay when the primary storage is back online. Scribe-hdfs is configured to write data to a cluster N1 and if N1 fails then it writes data to cluster N2. Scribe treats N1 and N2 as two equivalent primary stores. The scribe configuration should have fs_type=hdfs. For scribe compilation, you can use ./configure --enable-hdfs LDFLAGS="-ljvm -lhdfs". A good example for configuring scribe-hdfs is in a file called hdfs_example2.conf in the scribe code base.

Here are the settings for the Hadoop 0.17 configuration that is needed by an application doing writes in realtime:

ipc.client.idlethreshold

10000

Defines the threshold number of connections after which

connections will be inspected for idleness.

ipc.client.connection.maxidletime

10000

The maximum time in msec after which a client will bring down the

connection to the server.

ipc.client.connect.max.retries

Indicates the number of retries a client will make to establish

a server connection.

ipc.server.listen.queue.size

128

Indicates the length of the listen queue for servers accepting

client connections.

ipc.server.tcpnodelay

true

Turn on/off Nagle's algorithm for the TCP socket connection on

the server. Setting to true disables the algorithm and may decrease latency

with a cost of more/smaller packets.

ipc.client.tcpnodelay

true

Turn on/off Nagle's algorithm for the TCP socket connection on

the client. Setting to true disables the algorithm and may decrease latency

with a cost of more/smaller packets.

ipc.ping.interval

5000

The Client sends a ping message to server every period. This is helpful

to detect socket connections that were idle and have been terminated by a failed server.

ipc.client.connect.maxwaittime

5000

The Client waits for this much time for a socket connect call to be establised

with the server.

dfs.datanode.socket.write.timeout

20000

The dfs Client waits for this much time for a socket write call to the datanode.

ipc.client.ping

false

HADOOP-2757

Report from my visit to the Berkeley RAD Labs

2009-05-28T23:21:00.000-07:00

I went to attend the UC Berkeley RAD Lab Spring Retreat held at Santa Cruz. This Lab has about 30 Phd students and the quality of their work really impressed me a lot. Most of their work is based on research on distributed systems. There were many students who were working with Hadoop and it is amazing to see Hadoop being the core of so much research activity... when the Hadoop project started three years back, I definitely did not imagine that it will get to this state!

I had read David Patterson's papers during my graduate studies at Univ of Wisconsin Madison, and it was really nice to be able to meet him in person. And the group of students that he leads at the RAD labs is of very high calibre. Most people must have already seen the Above the Cloud whitepaper that the RAD Lab has produced. It tries to clear up the muddle on what Cloud Computing really is, its benefits and its possible usage scenarios. A good read!

A paper titled Detecting Large Scale System Problems by Mining Console Logs talks about using application logs from a distributed application to detect bugs, problems and anomalies in the system. They provide an example whereby they process 24 million log lines produced by HDFS to detect a bug (page 11 of the paper). I am not really convinced about the bug, but this is an effort in the right direction.
My employer Facebook is an enabler for research in distributed systems. To this effect, Facebook has allowed researchers from premier Universities to analyze Hadoop system logs. These logs typically record machine usage and Hadoop job performance metrics. The hadoop job records are inserted into a MySQL database for easy analysis (HADOOP-3708). Some students at the RAD Labs have used this database to prove that Machine Learning techniques can be used to predict Hadoop job performance. This is a paper that is not yet published. There is another paper that analyzes the performance characteristics of the Hadoop Fair share scheduler. Most of the abstract of these publications are available here.

Last, but not the least, SCADS is a scalable storage system that is specifically designed for social networking type software. It has a declarative query language and supports various consistency models. It supports a rich data model that includes joins on pre-computed queries.

Better Late than Never

2009-05-24T02:17:00.000-07:00

For quite a while, I have been thinking on blogging about Hadoop in general and Hadoop distributed file system (HDFS) in particular. Why, you may ask?

Firstly, I have been contacted by students from as far as Bangladesh and Fiji asking me questions about HDFS via email. This made me think that disseminating internal details about HDFS to the whole wide world would really benefit a lot of people. I like to interact with these budding engineers; and their questions, though elementary in nature, sometimes makes me really ruminate on why we adopted a particular design and not another. I will sprinkle a few of these examples next week.

Secondly, I visited a few Universities last month, among them Carnegie Mellon University and my alma-mater Univ of Wisconsin. On my flight, I was getting bored to death, because I really did not like the movie that was playing and I did not carry any material to read. (Usually I like to read and re-read Sherlock Holmes over and over again.) But like they say, " an idle mind is the devil's workhop".... I started to jot down some exotic design ideas about HDFS.... And lo behold, I have a list of ideas that I would like to share! I will post them next week as well.