Thursday, April 14, 2011

Data warehousing at Facebook

Many people have asked me to describe the best practices that we have adopted to run a multi PB data warehouse using Hadoop. Most of the details were described in a paper that we presented at SIGMOD 2010. This document refers to our state-of-affairs as it was about a year back, but is still an interesting read. Below is the abstract of this paper. You can find the complete paper here.

Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook - both engineering and non- engineering. Apart from ad hoc analysis of data and creation of business intelligence dashboards by analysts across the company, a number of Facebook's site features are also based on analyzing large data sets. These features range from simple reporting applications like Insights for the Facebook Advertisers, to more advanced kinds such as friend recommendations. In order to support this diversity of use cases on the ever increasing amount of data, a flexible infrastructure that scales up in a cost effective manner, is critical. We have leveraged, authored and contributed to a number of open source technologies in order to address these requirements at Facebook. These include Scribe, Hadoop and Hive which together form the cornerstones of the log collection, storage and analytics infrastructure at Facebook. In this paper we will present how these systems have come together and enabled us to implement a data warehouse that stores more than 15PB of data (2.5PB after compression) and loads more than 60TB of new data (10TB after compression) every day. We discuss the motivations behind our design choices, the capabilities of this solution, the challenges that we face in day today operations and future capabilities and improvements that we are working on.

Facebook has opensourced the version of Apache Hadoop that we use to power our production clusters. You can find more details about our usage of Hadoop at the Facebook Engineering Blog.


  1. This comment has been removed by the author.

  2. hi Dhruba:
    we encountered a situation like this: we are using the shortCircuit first report to reduce the namenode restart circuit and we used it for a long time. But recently we found this situation,when namenode start, and after all of the datanodes' report has been processed, the safemode can't leave automaticly because of the safe mode count can't reach the threshold pct of 0.999 default. It stop at 0.998 and we fsck the whole hdfs and it's reported that the hdfs is health. I noticed that the shortCircuit first report just skip the reportDiff and addStoredBlock to blocksmap,Even though I don't think this is the reason after review our hdfs code,is there any possibility that this is the course of our problem? Or did you ever encountered this?I think fb's experience will offer great help to us.

    thank you very much

  3. The default fsck does not analyze files that were being written when the namenode was killed. Please try running:

    bin/hadoop fsck / -files -blocks -locations -openforwrite

    this will print the files that have missing blocks (and the existence of missing blocks means that the NN will not exit safemode). You can manually exit safemode via

    bin/hadoop dfsadmin -safemode leave

  4. This comment has been removed by a blog administrator.

  5. Data warehousing combines data from multiple, usually varied, sources into one comprehensive and easily manipulated database.

  6. How to reduce the cost of the warehouse?
    please tell me how can a company reduce the cost of the warehouse for sale in their company. Its all related to the supply chain management.

  7. I must say, I thought this was a pretty interesting read when it comes to this topic. Liked the material. . . . . Jay Catlin