Monday, September 14, 2009

HDFS block replica placement in your hands now!

Most Hadoop administrators set the default replication factor for their files to be three. The main assumption here is that if you keep three copies of the data, your data is safe. I have observed this to be true in the big clusters that we manage and operate. In actuality, administrators are managing two failure aspects: data corruption and data availability.

If all the datanodes on which the replicas of a block exist catch fire at the same time, then that data is lost and cannot be recovered. Or if an administrative error causes all the existing replicas of a block to be deleted, then it is a catastrophic failure. This is data corruption. On the other hand, if a rack switch goes down for sometime, the datanodes on that rack are in-accessible during that time. When that faulty rack switch is fixed, the data on the rack rejoins the HDFS cluster and life goes on as usual. This is a data avilability issue; in this case data was not corrupted or lost, it was just unavailable for some time. HDFS keeps three copies of a block on three different datanodes to protect against true data corruption. HDFS also tries to distribute these three replicas on more than one rack to protect against data availability issues. The fact that HDFS actively monitors any failed datanode(s) and upon failure detection immediately schedules re-replication of blocks (if needed) implies that three copies of data on three different nodes is sufficient to avoid corrupted files.

HDFS uses a simple but highly effective policy to allocate replicas for a block. If a process that is running on any of the HDFS cluster nodes open a file for writing a block, then one replica of that block is allocated on the same machine on which the client is running. The second replica is allocated on a randomly chosen rack that is different from the rack on which the first replica was allocated. The third replica is allocated on a randomly chosen machine on the same remote rack that was chosen in the earlier step. This means that a block is present on two unique racks. One point to note is that there is no relationship between replicas of different blocks of the same file as far as their location is concerned. Each block is allocated independently.

The above algorithm is great for availability and scalability. However, there are scenarios where co-locating many block of the same file on the same set of datanode(s) or rack(s) is beneficial for performance reasons. For example, if many blocks of the same file are present on the same datanode(s), a single mapper instance could process all these blocks using the CombineFileInputFormat. Similarly, if a dataset contains many small files that are co-located on the same datanode(s) or rack(s), one can use CombineFileInputFormat to process all these file together by using fewer mapper instances via CombineFileInputFormat. If an application always uses one dataset with another dataset (think Hive or Pig join), then co-locating these two datasets on the same set of datanodes is beneficial.

Another reason when one might want to allocate replicas using a different policy is to ensure that replicas and their parity blocks truly reside in different failure domains. The erasure code work in HDFS could effectively bring down the physical replication factor of a file to about 1.5 (while keeping the logical replication factor at 3) if it can place replicas of all blocks in a stripe more intelligently.

Yet another reason, however exotic, is to allow HDFS to place replicas based on the HeatMap of your cluster. If one of of the node in the cluster is at a higher temperature than that of another, then it might be better to prefer the cooler node while allocating a new replica. If you want to experiment with HDFS across two data centers, you might want to try out new policies for replica placement.

Well, now you can finally get your hands wet! HDFS-385 is part of the Hadoop trunk and will be part of the next major HDFS 0.21 release. This feature provides a way for the adventurous developer to write Java code that specifies how HDFS should allocate replicas of blocks of a file. The API is experimental in nature, and could change in the near future if we discover any in-efficiencies in it. Please let the Hadoop community know if you need any changes in this API or if you come across novel uses of this API.

24 comments:

  1. Correct me if I am wrong. Using this HDFS-385 patch we can make HDFS to store all blocks of a file on a same node (physically)?

    ReplyDelete
  2. Hi Jenium, your reasoning is correct. You can now supply your ownblock placement policies to HDFS.

    ReplyDelete
  3. Dhruba,thanks for the answer. May be it is not the best place to ask but what happened to HDFS snapshot capability feature? Is anybody working on it right now? If not then why? Incremental snapshot based backup is the only one from many other use cases. Actually, we have at least two of them at our company.

    -Vladimir

    ReplyDelete
  4. Nobody is working on HDFS snapshots (as far as I know). Here is a v v preliminary design document that I had written

    http://issues.apache.org/jira/browse/HDFS-233

    ReplyDelete
  5. here is an example of ho we wrote our own block placement policy:
    https://github.com/facebook/hadoop-20-append/blob/master/src/hdfs/org/apache/hadoop/hdfs/server/namenode/BlockPlacementPolicyConfigurable.java

    ReplyDelete
    Replies
    1. Hello Sir,
      I am implementing my own replicating policy for hadoop which is work for my desertion,plz guide me and link that u have mentioned is not working "404 page not found" error.

      Delete
  6. Thanks a lot for your kind help...But I have still a difficulty on working with hadoop under eclipse..Please suggest me simple ways if you have..I think your help is very useful for me...Thanks you in advance..

    ReplyDelete
  7. Hey Dhruba,
    It seems the link to Configurable Block Placement Policy example is broken.
    Could you please re-post it?

    Thank you.

    ReplyDelete
  8. https://github.com/facebook/hadoop-20/blob/master/src/hdfs/org/apache/hadoop/hdfs/server/namenode/BlockPlacementPolicyConfigurable.java

    ReplyDelete
  9. Thank you, Dhruba.

    ReplyDelete
  10. on read the replicas are selected according to the distance with the client ,
    how hdfs calculate the distance

    ReplyDelete
  11. The distance is calculated via the network topology script, more details here: http://developer.yahoo.com/hadoop/tutorial/module2.html#rack

    ReplyDelete
  12. Dhruba,
    Can you modify your git hub file movement code to move ,
    a from <192.168.1.1> to <192.168.1.2>

    Thanks and Regards
    Ram

    ReplyDelete
  13. Ram: I am not sure I understand your request.

    ReplyDelete
    Replies
    1. thanks for fast response ,a portion is missed in the question posting it again
      Ram

      Delete
  14. Dhruba,
    Can you modify your git hub file movement code to move ,
    a < block >from <192.168.1.1> to <192.168.1.2>

    Thanks and Regards
    Ram
    Reply

    ReplyDelete
  15. Hi
    I have a small hadoop experimental setup with 1 Namenode and 3 Datanodes(say slave1, slave2, slave3) and replication factor of two
    I did a CopyFromLocal command to store a small file into hadoop
    When I tried hadoop fsck -files -blocks -racks , I found it is stored in slave1 and slave3
    then when I tried CopyToLocal from slave1 and slave3 (where the data is stored) It takes more time than from slave2( where data is not present)
    I checked "duration" in datanode log file as well as

    And one more thing for the first CopyToLocal it takes more time (27341012) than corresponding CopyToLocal(562176)

    Please tell me why it is so ?
    Reetha

    ReplyDelete
  16. Reetha, HDFS does bufferedIO which means data in HDFS files will be cached in the OS buffer. That is the reason why your CopyToLocal times might be faster the second time around.

    Regarding "why reading from slave2" is much faster, I have no idea. You could analyze network traffic, disk IO, etc that was happening simultaneously at the time of your test run to figure out what was going on.

    ReplyDelete
  17. where can i find details on how to use this plug-in??
    My requirement is actually very simple.When i place a block, i want to place it in the same datanode as another existing block. So i jus want to retreive the location of the previous block( using FileSystem) and force hdfs to place the new block in the same set of nodes. Will this plug-in be useful for my purpose??
    Is there a simpler way to achieve what i want to implement??

    ReplyDelete
  18. we would like store 1 of the 3 replicas off-site for (possible) recovery on a site
    failure.


    we have HDFS cluster as two replication in two different racks( rack0 , rack1). Now we need to keep one more replica iin aonther gepgraphical location for backup.Is it possible to configure? Do you know datacenter rack awareness in hdfs?

    ReplyDelete
  19. hey can anyone tell me how can u limiting the blocks to be located on a single node??i am working on this rite nw..

    ReplyDelete
  20. Sir, I have an one doubt if one file of block stored in same node then how will you get the scalablity at the processing time in context of parallel processing?

    ReplyDelete