HDFS: Facebook has the world's largest Hadoop cluster!

Sunday, May 9, 2010

Facebook has the world's largest Hadoop cluster!

It is not a secret anymore!

The Datawarehouse Hadoop cluster at Facebook has become the largest known Hadoop storage cluster in the world. Here are some of the details about this single HDFS cluster:

21 PB of storage in a single HDFS cluster
2000 machines
12 TB per machine (a few machines have 24 TB each)
1200 machines with 8 cores each + 800 machines with 16 cores each
32 GB of RAM per machine
15 map-reduce tasks per machine

That's a total of more than 21 PB of configured storage capacity! This is larger than the previously known Yahoo!'s cluster of 14 PB. Here are the cluster statistics from the HDFS cluster at Facebook:

Hadoop started at Yahoo! and full marks to Yahoo! for developing such critical infrastructure technology in the open. I started working with Hadoop when I joined Yahoo! in 2006. Hadoop was in its infancy at that time and I was fortunate to be part of the core set of Hadoop engineers at Yahoo!. Many thanks to Doug Cutting for creating Hadoop and Eric14 for convincing the executing management at Yahoo! to develop Hadoop as open source software.

Facebook engineers work closely with the Hadoop engineering team at Yahoo! to push Hadoop to greater scalability and performance. Facebook has many Hadoop clusters, the largest among them is the one that is used for Datawarehousing. Here are some statistics that describe a few characteristics of the Facebook's Datawarehousing Hadoop cluster:

12 TB of compressed data added per day
800 TB of compressed data scanned per day
25,000 map-reduce jobs per day
65 millions files in HDFS
30,000 simultaneous clients to the HDFS NameNode

A majority of this data arrives via scribe, as desribed in scribe-hdfs integration. This data is loaded in Hive. Hive provides a very elegant way to query the data stored in Hadoop. Almost 99.9% Hadoop jobs at Facebook are generated by a Hive front-end system. We provide lots more details about our scale of operations in our paper at SIGMOD titled Datawarehousing and Analytics Infrastructure at Faceboo k.

Here are two pictorial representations of the rate of growth of the Hadoop cluster:

Details about our Hadoop configuration

I have fielded many questions from developers and system administrators about the Hadoop configuration that is deployed in the Facebook Hadoop Datawarehouse. Some of these questions are from Linux kernel developers who would like to make Linux swapping work better with Hadoop workload; other questions are from JVM developers who may attempt to make Hadoop run faster for processes with large heap size; yet others are from GPU architects who would like to port a Hadoop workload to run on GPUs. To enable this type of outside research, here are the details about the Facebook's Hadoop warehouse configurations. I hope this open sharing of infrastructure details from Facebook jumpstarts the research community to design ways and means to optimize systems for Hadoop usage.

53 comments:

yongqiang heJune 14, 2010 at 12:24 AM
first.
ReplyDelete
Replies
ursulaJune 14, 2010 at 4:12 AM
amazing hadoop
ReplyDelete
Replies
JiaqiJune 14, 2010 at 8:49 AM
what's the carbon footprint/power consumption? mind-boggling..
ReplyDelete
Replies
Ashwin JayaprakashJune 14, 2010 at 10:30 AM
That SIGMOD link is broken. Here it is - link
ReplyDelete
Replies
MintiJune 14, 2010 at 6:09 PM
Wow....amazing...!!!
ReplyDelete
Replies
UnknownJune 24, 2010 at 11:21 AM
Where's the like button on this thing, Dhruba?
ReplyDelete
Replies
Michael N MarcusJune 25, 2010 at 5:14 PM
what's the carbon footprint/power consumption? mind-boggling..
ReplyDelete
Replies
AnonymousJune 30, 2010 at 8:37 PM
Hadoop, the rainforest killer
ReplyDelete
Replies
UnknownJuly 12, 2010 at 4:30 AM
people who are concerned about carbon foot print here is my answer, the scenario would have been worse, the number of servers needing to serve such huge task is humengous and hadoop optimizies the resources.
ReplyDelete
Replies
Jonathan DisherSeptember 7, 2010 at 3:53 PM
Are you sure that cluster is bigger than the newer 4k machine clusters at Yahoo? I seem to recall they had a couple bigger than this....
ReplyDelete
Replies
Dhruba BorthakurSeptember 7, 2010 at 9:21 PM
@funjon, from what I hear, all of the 4 K nodes in the Yahoo's cluster have 4 TB of disk each. http://developer.yahoo.net/blogs/hadoop/2010/05/scalability_of_the_hadoop_dist.html
ReplyDelete
Replies
Brad FallonMarch 5, 2011 at 10:42 PM
That's great! This is how technology works in its finest. That is really amazing!
ReplyDelete
Replies
UnknownMarch 12, 2011 at 6:54 PM
With disks failing (possibly resulting in node shutdown) and rebuild/recovery that needs to be done, can you let me know how many people it would take to manage the a cluster of the size that FB has?
ReplyDelete
Replies
Dhruba BorthakurMarch 12, 2011 at 10:53 PM
@Naren: we have one admin person who manages the hdfs cluster. He is a person responsible for deploying new software, monitoring health, reporting and categorization of issues that arise as part of operations, etc.etc. Then maybe another virtual person(s) who spends a few hours every week to gather all failed machines/disks and send them to a repair facility.
ReplyDelete
Replies
AnonymousMarch 16, 2011 at 2:36 PM
Nice article...

hadoophelp.blogspot.com
ReplyDelete
Replies
AnonymousJuly 7, 2011 at 5:03 PM
A few things that interest me about your configuration files (many thanks for posting!)

1. You don't use LZO compression, but rather Gzip.

2. With 12TB/24TB, I'm assuming 12 spindles. Mapper contention on spindles usually creates problems with one DataNode handling > 8 spindles.

3. With 16 cores, only having 15 slots (9 map, 5 reduce) seems low. And 1GB per task means only using 15GB out of the 32GB on the box.

Thanks for any feedback on the above,

-- Ken
ReplyDelete
Replies
Rudra455July 7, 2011 at 7:27 PM
This comment has been removed by a blog administrator.
ReplyDelete
Replies
Dhruba BorthakurJuly 8, 2011 at 12:03 AM
1. we use LZO for map outputs (less CPU) but use GZIP for reduce outputs (lesser disk space).

2. we have 12 spindles.

3. our map or reduce computations are very CPU heavy and the cluster is bottlenecked on CPU (rather than IOPs). The 1 GB per task is just the default. Most jobs (via Hive) are allowed to set their own JVM heap size.
ReplyDelete
Replies
क्षण ....July 19, 2011 at 1:29 PM
hi I have 4 machines Suse-Linux11 , I need to set up a 4 node hadoop cluster I have RAM 16GB [16 cores] per machines.
I need to know how may maps and reduces should I configure? Also Can I have multiple clusters on same 4 machines by just changing the port numbers and other directories and running hadoop with separate user.?
ReplyDelete
Replies
Dhruba BorthakurJuly 19, 2011 at 10:23 PM
Yes, u can run multiple hdfs clusters on the same set of machines (as long as they use different ports)
ReplyDelete
Replies
Ved AntaniSeptember 5, 2011 at 11:28 PM
Dhruba, what do you do for Realtime analytics? do you use something like Flume? or you have your own ?
ReplyDelete
Replies
Dhruba BorthakurSeptember 5, 2011 at 11:31 PM
for realtime analytics, we use HBase. http://hadoopblog.blogspot.com/2011/07/realtime-hadoop-usage-at-facebook.html
ReplyDelete
Replies
toniSeptember 5, 2011 at 11:40 PM
Do you guys use puppet, chef or custom scripts to configure and keep up to date the machines?
ReplyDelete
Replies
AnonymousSeptember 5, 2011 at 11:41 PM
What is your backup plan for the Hadoop cluster? does backup of hadoop cluster makes sense for you? if so do you quiesce the hive before backup? and how is new/modified data detected (as the data sizes are so huge)?
ReplyDelete
Replies
SebastianSeptember 6, 2011 at 12:57 AM
Hadoop did NOT start at Yahoo. It was born out of the Apache Nutch project.
ReplyDelete
Replies
AnonymousSeptember 6, 2011 at 7:50 AM
Two questions:

1) What is the required versus achieving IOPS & Latency out of each nodes storage subsystem? Asked another way... what were you aiming for and what did you actually get in terms of performance?

2) How does the failure of -- for example -- 10 nodes affect the cluster?
ReplyDelete
Replies
SriSeptember 6, 2011 at 11:50 AM
It's really amazing....
ReplyDelete
Replies
Dhruba BorthakurSeptember 6, 2011 at 12:10 PM
@toni: we use custom scripts to configure and deploy software on hadoop machines.

@The Hive cluster is a pure warehouse. That means that if you backup the 20+ TB of new data that comes in every day, all other data can b derived from that stream. So, we have processes to replicate data ascross data centers and as long as we can copy the source data to multiple data centers, we have a good story on backup (including DR).

@Jeff: we focussed on job-pipleline latencies. That means a certain pipeline (bunch of hive jobs) have to finish within a certain time. Regarding ur other question: we have had cases when a rack fails. A rack has 20 machines. When this happens, we see that HDFS re-replicates the data and this re-replication finishes in about an hour, i.e. our mean-time-to-recover from a failed rack is about 1 hour. However, jobs continue to run normally during this period.
ReplyDelete
Replies
SVmanNovember 25, 2011 at 11:12 PM
Ya know, I think I have an idea that would reduce all the hardware requirements down to a fraction of the thousands of servers currently employed.

It's a purely analytic solution, but it would work and would be very scalable, especially with the larger sets of data.
ReplyDelete
Replies
UnknownJanuary 18, 2012 at 8:45 PM
Great helpful information. Thanks for providing wonderful stats of hadoop usage at FB.
Hadoop can be used for olap as well as OLTP.
Please click why hadoop is introduced
ReplyDelete
Replies
AnonymousMarch 29, 2012 at 4:57 AM
@Dhruba : Thanks for (all) the post(s). Can you give us updated figures about the cluster size at the begining of 2012 ? Is the growth still amazing ?
ReplyDelete
Replies
UnknownMay 24, 2012 at 2:23 AM
Great post....

worldofhadoop.blogspot.com
ReplyDelete
Replies
Facebook Cover PhotosSeptember 25, 2012 at 2:22 AM
Wow....amazing...!!
ReplyDelete
Replies
raovaJanuary 10, 2013 at 8:03 AM
very nice post.
ReplyDelete
Replies
AnonymousFebruary 7, 2013 at 6:27 AM
We found interesting link for the Hadoop developer

60 Hadoop Interview Question
http://www.pappupass.com/Hadoop_Interview_Question.pdf

follow link for Hadoop Exam Simulator
http://www.pappupass.com/class/index.php/hadoop/hadoop-exam-simulator
ReplyDelete
Replies
UnknownMarch 17, 2013 at 9:06 PM
Hello,
what a amazing news is this! The Datawarehouse Hadoop cluster at Facebook has become the largest known Hadoop storage cluster in the world is really a excellent information.I love it.Thanks a lot
Used Pallet Racks

ReplyDelete
Replies
UnknownMay 22, 2013 at 2:28 AM
I think yahoo has around 42000 nodes in their cluster and LinkedIn has around some 4000 nodes. May be FB has large data in it. But when it comes to the number of data nodes it will be yahoo I guess...
ReplyDelete
Replies
Dhruba BorthakurMay 22, 2013 at 8:23 AM
@Pradeep: The 42000 nodes number from Yahoo is the total number of nodes in all the hdfs clusters in production at Y!.. and not from a single cluster.
ReplyDelete
Replies
VenkatMay 22, 2013 at 9:32 AM
Wonderful Info.
ReplyDelete
Replies
UnknownMay 23, 2013 at 8:47 AM
Hey this shows the scope of HADOOP.
What do you think programmers?
Its time to learn Hadoop online.
I am looking for online hadoop live tutorial means online course.
and one of my friend suggested me WIZIQ for online learning having course id 21308.
and they are giving free demo for any course.
So what do u think?
I am thinking to take such course and make myself scope in such field.
Wanna learn HADOOP then do check once WIZIQ.
Thank You.
ReplyDelete
Replies
UnknownMay 23, 2013 at 9:58 AM
Very nice and informative blog.

@Shruti: Ya Hadoop has great scope now a days.
As you can get an idea from this blog too.
And I took this course from WIZIQ and now I am doing job as HADOOP developer.
wanna tell you that this course is awsme as tutor is cloudera certified and he knows where we lag and where we make mistakes.
Thanks to him as I got a job only because of that tutor.
And yes WIZIQ is very supportive and Very responsive.
Just close your eyes and click enroll button.
:)
Hope it will be helpful for you.

ReplyDelete
Replies
Chloe PhillipsJune 26, 2013 at 4:18 AM
I think the things you covered through the post are quite impressive, good job and great efforts. I found it very interesting and enjoyed reading all of it... keeps it up, good job.
ReplyDelete
Replies
AnonymousJuly 28, 2013 at 5:18 AM
your blog is very nice.Hadoop is very important for any organization, So hadoop training is must to improve yourself business.
thanks for the tips.hadoop online tutorial
ReplyDelete
Replies
kumarSeptember 29, 2013 at 11:05 PM
Great helpful information. Thanks for providing wonderful hadoop information.123trainings provides hadoop online training we can see free demo class
hadoop online training classes in hyderabad.
ReplyDelete
Replies
kumarOctober 1, 2013 at 2:03 AM
It's amazing and this information is very very useful for us.123trainings also provides hadoop online traning
to see free demohadoop online training classes in india
ReplyDelete
Replies
kumarOctober 1, 2013 at 2:09 AM

It's amazing and this information is very very useful for us.Hadoop online trainings also provides hadoop online traning
ReplyDelete
Replies
sudheerOctober 3, 2013 at 11:17 PM
it is a good piece of knowledge and it is used for hadoop learners.123trainings provides besthadoop online training to see free demo classHadoo online training demo class in Ameerpet
ReplyDelete
Replies
sudheerOctober 3, 2013 at 11:35 PM
it is a good piece of knowledge and it is used for hadoop learners.Hadoop online trainings provides besthadoop online training
ReplyDelete
Replies
kumarOctober 11, 2013 at 11:29 PM
Itis good and it is very helpful for us.123trainings provides best online Hadoop training .to see demo Hadoop online training demo class in hyderabad
ReplyDelete
Replies
UnknownOctober 17, 2013 at 12:05 AM
Thanks a lot for the wonderful information and it is useful for us.123trainings provides best Hadoop online training.tosee free demo classHadoop online training class in india
ReplyDelete
Replies
DebuNovember 7, 2013 at 4:24 AM
Is there an architecture diagram explaining the latest Hadoop cluster configuration at Hadoop ? Such as the size o data processed and the number of nodes etc.
ReplyDelete
Replies

Add comment

HDFS

Sunday, May 9, 2010

Facebook has the world's largest Hadoop cluster!

53 comments:

Search This Blog

StatCounter

Followers