Introduction
What are the bottlenecks of the NameNode?
Network: We have around 2000 nodes in our cluster and each node is running 9 mappers and 6 reducers simultaneously. This means that there are around 30K simultaneous clients requesting service from the NameNode. The Hive Metastore and the HDFS RaidNode imposes additional load on the NameNode. The Hadoop RPCServer has a singleton Listener Thread that pulls data from all incoming RPCs and hands it to a bunch of NameNode handler threads. Only after all the incoming parameters of the RPC are copied and deserialized by the Listener Thread does the NameNode handler threads get to process the RPC. One CPU core on our NameNode machine is completely consumed by the Listener Thread. This means that during times of high load, the Listener Thread is unable to copy and deserialize all incoming RPC data in time, thus leading to clients encountering RPC socket errors. This is one big bottleneck to vertically scalabiling of the NameNode.
CPU: The second bottleneck to scalability is the fact that most critical sections of the NameNode is protected by a singleton lock called the FSNamesystem lock. I had done some major restructuring of this code about three years ago via HADOOP-1269 but even that is not enough for supporting current workloads. Our NameNode machine has 8 cores but a fully loaded system can use at most only 2 cores simultaneously on the average; the reason being that most NameNode handler threads encounter serialization via the FSNamesystem lock.
Memory: The NameNode stores all its metadata in the main memory of the singleton machine on which it is deployed. In our cluster, we have about 60 million files and 80 million blocks; this requires the NameNode to have a heap size of about 58GB. This is huge! There isn't any more memory left to grow the NameNode's heap size! What can we do to support even greater number of files and blocks in our system?
Can we break the impasse?
RPC Server: We enhanced the Hadoop RPC Server to have a pool of Reader Threads that work in conjunction with the Listener Thread. The Listener Thread accepts a new connection from a client and then hands over the work of RPC-parameter-deserialization to one of the Reader Threads. In our case, we configured our system so that the Reader Threads consist of 8 threads. This change has doubled the number of RPCs that the NameNode can process at full throttle. This change has been contributed to the Apache code via HADOOP-6713.
The above change allowed a simulated workload to be able to consume 4 CPU cores out of a total of 8 CPU cores in the NameNode machine. Sadly enough, we still cannot get it to use all the 8 CPU cores!
FSNamesystem lock: A review of our workload showed that our NameNode typically has the following distribution of requests:
- stat a file or directory 47%
- open a file for read 42%
- create a new file 3%
- create a new directory 3%
- rename a file 2%
- delete a file 1%
The memory bottleneck issue is still unresolved. People have asked me if the NameNode can keep some portion of its metadata in disk, but this will require a change in locking model design first. One cannot keep the FSNamesystem lock while reading in data from the disk: this will cause all other threads to block thus throttling the performance of the NameNode. Could one use flash memory effectively here? Maybe an LRU cache of file system metadata will work well with current metadata access patterns? If anybody has good ideas here, please share it with the Apache Hadoop community.
In a Nutshell
The two proposed enhancements have improved NameNode scalability by a factor of 8. Sweet, isn't it?
Very cool!!
ReplyDeleteAwesome optimization and great article
ReplyDeleteGreat arcticle Dhurba. I think for last part memory optimization (disk writes) SSD's might be usefull. Namenode anyway is special.
ReplyDeleteThanks for the feedback folks.
ReplyDelete@Nitin: I am planning to play around with flash memory to alleviate the memory pressure on the NameNode. Will report my findings to this blog when I get any performance numbers.
http://digitalcommons.unl.edu/cgi/viewcontent.cgi?article=1064&context=csearticles
ReplyDeleteExcellent! Looking forward to 0.21.0.
ReplyDeleteHi Dhruba, I've been following your work with interest for a few months now.
ReplyDeleteI think the best approach is to focus on distributing the NameServer across multiple machines (like the next version of GFS, http://queue.acm.org/detail.cfm?id=1594206 ). With your insight into FaceBook's operations, do you anticipate any problems with moving to a distributed NameServer model? Appropriately distributing the NS seems like the best long-term solution to both scaling and fault-tolerance.
An even more radical solution would be to drop HDFS and use something like Ceph, which was designed for multiple metadata servers. A group at UC Santa Cruz is adapting Hadoop to work with Ceph:
http://users.soe.ucsc.edu/~carlosm/Papers/eestolan-nsdi10-abstract.pdf
Ceph is now in the mainline Linux kernel, so support is improving. More information about using Ceph on the webpage: http://ceph.newdream.net/
Design document (OSDI 2006 paper):
http://www.ssrc.ucsc.edu/Papers/weil-osdi06.pdf
Thanks for your comments Andrew.
ReplyDeletesome folks have started work on multiple namenodes and there is a preliminary document posted at http://issues.apache.org/jira/browse/HDFS-1052
Your comments about this proposed design is very welcome.
CEPH is an interesting idea and I have been following it since 2007. The nsdi abstract you refer has some interesting performance titbits. I am looking forward to the full document, especially about performance on realistic size cluster.
hi Dhruba: any namenode rpc performance comparation between the hdfs version of after merge the read/write lock and the version before?
ReplyDeleteWe have not measures rpc latencies... I expect them to be the same. However, we did measure NN throughout, it is about 3 times better than what it was earlier!
ReplyDeleteLustre/Ceph has great advantage over HDFS for high performance clustering environment.
ReplyDeleteGreate Article!! I have one doubt, what happens when Name node's memory get full 100%. will it go to safe-mode or becomes unavailable? Please help me to understand it. thanks.
ReplyDeleteIf the namenode's memory is 100% full, then it will become unavailable.
ReplyDelete