HDFS is designed to be a highly scalable storage system and sites at Facebook and Yahoo have 20PB size file systems in production deployments. The HDFS NameNode is the master of the Hadoop Distributed File System (HDFS). It maintains the critical data structures of the entire file system. Most of HDFS design has focussed on scalability of the system, i.e. the ability to support a large number of slave nodes in the cluster and an even larger number of files and blocks. However, a 20PB size cluster with 30K simultaneous clients requesting service from a single NameNode means that the NameNode has to run on a high-end non-commodity machine. There has been some efforts to scale the NameNode horizontally, i.e. allow the NameNode to run on multiple machines. I will defer analyzing those horizontal-scalability-efforts for a future blog post, instead let's discuss ways and means to make our singleton NameNode support an even greater load.
What are the bottlenecks of the NameNode?
Network: We have around 2000 nodes in our cluster and each node is running 9 mappers and 6 reducers simultaneously. This means that there are around 30K simultaneous clients requesting service from the NameNode. The Hive Metastore and the HDFS RaidNode imposes additional load on the NameNode. The Hadoop RPCServer has a singleton Listener Thread that pulls data from all incoming RPCs and hands it to a bunch of NameNode handler threads. Only after all the incoming parameters of the RPC are copied and deserialized by the Listener Thread does the NameNode handler threads get to process the RPC. One CPU core on our NameNode machine is completely consumed by the Listener Thread. This means that during times of high load, the Listener Thread is unable to copy and deserialize all incoming RPC data in time, thus leading to clients encountering RPC socket errors. This is one big bottleneck to vertically scalabiling of the NameNode.
CPU: The second bottleneck to scalability is the fact that most critical sections of the NameNode is protected by a singleton lock called the FSNamesystem lock. I had done some major restructuring of this code about three years ago via HADOOP-1269 but even that is not enough for supporting current workloads. Our NameNode machine has 8 cores but a fully loaded system can use at most only 2 cores simultaneously on the average; the reason being that most NameNode handler threads encounter serialization via the FSNamesystem lock.
Memory: The NameNode stores all its metadata in the main memory of the singleton machine on which it is deployed. In our cluster, we have about 60 million files and 80 million blocks; this requires the NameNode to have a heap size of about 58GB. This is huge! There isn't any more memory left to grow the NameNode's heap size! What can we do to support even greater number of files and blocks in our system?
Can we break the impasse?
RPC Server: We enhanced the Hadoop RPC Server to have a pool of Reader Threads that work in conjunction with the Listener Thread. The Listener Thread accepts a new connection from a client and then hands over the work of RPC-parameter-deserialization to one of the Reader Threads. In our case, we configured our system so that the Reader Threads consist of 8 threads. This change has doubled the number of RPCs that the NameNode can process at full throttle. This change has been contributed to the Apache code via HADOOP-6713.
The above change allowed a simulated workload to be able to consume 4 CPU cores out of a total of 8 CPU cores in the NameNode machine. Sadly enough, we still cannot get it to use all the 8 CPU cores!
FSNamesystem lock: A review of our workload showed that our NameNode typically has the following distribution of requests:
- stat a file or directory 47%
- open a file for read 42%
- create a new file 3%
- create a new directory 3%
- rename a file 2%
- delete a file 1%
The memory bottleneck issue is still unresolved. People have asked me if the NameNode can keep some portion of its metadata in disk, but this will require a change in locking model design first. One cannot keep the FSNamesystem lock while reading in data from the disk: this will cause all other threads to block thus throttling the performance of the NameNode. Could one use flash memory effectively here? Maybe an LRU cache of file system metadata will work well with current metadata access patterns? If anybody has good ideas here, please share it with the Apache Hadoop community.
In a Nutshell
The two proposed enhancements have improved NameNode scalability by a factor of 8. Sweet, isn't it?