HDFS: HDFS block replica placement in your hands now!

Monday, September 14, 2009

HDFS block replica placement in your hands now!

Most Hadoop administrators set the default replication factor for their files to be three. The main assumption here is that if you keep three copies of the data, your data is safe. I have observed this to be true in the big clusters that we manage and operate. In actuality, administrators are managing two failure aspects: data corruption and data availability.

If all the datanodes on which the replicas of a block exist catch fire at the same time, then that data is lost and cannot be recovered. Or if an administrative error causes all the existing replicas of a block to be deleted, then it is a catastrophic failure. This is data corruption. On the other hand, if a rack switch goes down for sometime, the datanodes on that rack are in-accessible during that time. When that faulty rack switch is fixed, the data on the rack rejoins the HDFS cluster and life goes on as usual. This is a data avilability issue; in this case data was not corrupted or lost, it was just unavailable for some time. HDFS keeps three copies of a block on three different datanodes to protect against true data corruption. HDFS also tries to distribute these three replicas on more than one rack to protect against data availability issues. The fact that HDFS actively monitors any failed datanode(s) and upon failure detection immediately schedules re-replication of blocks (if needed) implies that three copies of data on three different nodes is sufficient to avoid corrupted files.

HDFS uses a simple but highly effective policy to allocate replicas for a block. If a process that is running on any of the HDFS cluster nodes open a file for writing a block, then one replica of that block is allocated on the same machine on which the client is running. The second replica is allocated on a randomly chosen rack that is different from the rack on which the first replica was allocated. The third replica is allocated on a randomly chosen machine on the same remote rack that was chosen in the earlier step. This means that a block is present on two unique racks. One point to note is that there is no relationship between replicas of different blocks of the same file as far as their location is concerned. Each block is allocated independently.

The above algorithm is great for availability and scalability. However, there are scenarios where co-locating many block of the same file on the same set of datanode(s) or rack(s) is beneficial for performance reasons. For example, if many blocks of the same file are present on the same datanode(s), a single mapper instance could process all these blocks using the CombineFileInputFormat. Similarly, if a dataset contains many small files that are co-located on the same datanode(s) or rack(s), one can use CombineFileInputFormat to process all these file together by using fewer mapper instances via CombineFileInputFormat. If an application always uses one dataset with another dataset (think Hive or Pig join), then co-locating these two datasets on the same set of datanodes is beneficial.

Another reason when one might want to allocate replicas using a different policy is to ensure that replicas and their parity blocks truly reside in different failure domains. The erasure code work in HDFS could effectively bring down the physical replication factor of a file to about 1.5 (while keeping the logical replication factor at 3) if it can place replicas of all blocks in a stripe more intelligently.

Yet another reason, however exotic, is to allow HDFS to place replicas based on the HeatMap of your cluster. If one of of the node in the cluster is at a higher temperature than that of another, then it might be better to prefer the cooler node while allocating a new replica. If you want to experiment with HDFS across two data centers, you might want to try out new policies for replica placement.

Well, now you can finally get your hands wet! HDFS-385 is part of the Hadoop trunk and will be part of the next major HDFS 0.21 release. This feature provides a way for the adventurous developer to write Java code that specifies how HDFS should allocate replicas of blocks of a file. The API is experimental in nature, and could change in the near future if we discover any in-efficiencies in it. Please let the Hadoop community know if you need any changes in this API or if you come across novel uses of this API.

26 comments:

replica watchesOctober 20, 2009 at 4:31 AM
Policies always sucks
ReplyDelete
Replies
Vladimir RodionovMarch 21, 2010 at 5:07 PM
Correct me if I am wrong. Using this HDFS-385 patch we can make HDFS to store all blocks of a file on a same node (physically)?
ReplyDelete
Replies
Dhruba BorthakurMarch 22, 2010 at 6:39 AM
Hi Jenium, your reasoning is correct. You can now supply your ownblock placement policies to HDFS.
ReplyDelete
Replies
Vladimir RodionovMarch 22, 2010 at 4:14 PM
Dhruba,thanks for the answer. May be it is not the best place to ask but what happened to HDFS snapshot capability feature? Is anybody working on it right now? If not then why? Incremental snapshot based backup is the only one from many other use cases. Actually, we have at least two of them at our company.

-Vladimir
ReplyDelete
Replies
Dhruba BorthakurMarch 23, 2010 at 6:55 AM
Nobody is working on HDFS snapshots (as far as I know). Here is a v v preliminary design document that I had written

http://issues.apache.org/jira/browse/HDFS-233
ReplyDelete
Replies
AnonymousOctober 8, 2010 at 5:21 AM
fitness reviews weight loss reviews
ReplyDelete
Replies
Dhruba BorthakurSeptember 23, 2011 at 5:38 AM
here is an example of ho we wrote our own block placement policy:
https://github.com/facebook/hadoop-20-append/blob/master/src/hdfs/org/apache/hadoop/hdfs/server/namenode/BlockPlacementPolicyConfigurable.java
ReplyDelete
Replies
Ju JuSeptember 25, 2011 at 10:16 PM
Thanks a lot for your kind help...But I have still a difficulty on working with hadoop under eclipse..Please suggest me simple ways if you have..I think your help is very useful for me...Thanks you in advance..
ReplyDelete
Replies
HemanthNovember 9, 2011 at 2:28 AM
Hey Dhruba,
It seems the link to Configurable Block Placement Policy example is broken.
Could you please re-post it?

Thank you.
ReplyDelete
Replies
Dhruba BorthakurNovember 9, 2011 at 4:27 PM
https://github.com/facebook/hadoop-20/blob/master/src/hdfs/org/apache/hadoop/hdfs/server/namenode/BlockPlacementPolicyConfigurable.java
ReplyDelete
Replies
AnonymousNovember 9, 2011 at 6:02 PM
Thank you, Dhruba.
ReplyDelete
Replies
VijuPoonthottamSeptember 22, 2012 at 7:04 AM
on read the replicas are selected according to the distance with the client ,
how hdfs calculate the distance
ReplyDelete
Replies
Dhruba BorthakurSeptember 23, 2012 at 12:28 AM
The distance is calculated via the network topology script, more details here: http://developer.yahoo.com/hadoop/tutorial/module2.html#rack
ReplyDelete
Replies
AnonymousJanuary 7, 2013 at 10:22 PM
Dhruba,
Can you modify your git hub file movement code to move ,
a from <192.168.1.1> to <192.168.1.2>

Thanks and Regards
Ram
ReplyDelete
Replies
Dhruba BorthakurJanuary 7, 2013 at 10:24 PM
Ram: I am not sure I understand your request.
ReplyDelete
Replies
AnonymousJanuary 7, 2013 at 10:24 PM
Dhruba,
Can you modify your git hub file movement code to move ,
a < block >from <192.168.1.1> to <192.168.1.2>

Thanks and Regards
Ram
Reply
ReplyDelete
Replies
AnonymousJanuary 16, 2013 at 10:18 PM
Hi
I have a small hadoop experimental setup with 1 Namenode and 3 Datanodes(say slave1, slave2, slave3) and replication factor of two
I did a CopyFromLocal command to store a small file into hadoop
When I tried hadoop fsck -files -blocks -racks , I found it is stored in slave1 and slave3
then when I tried CopyToLocal from slave1 and slave3 (where the data is stored) It takes more time than from slave2( where data is not present)
I checked "duration" in datanode log file as well as

And one more thing for the first CopyToLocal it takes more time (27341012) than corresponding CopyToLocal(562176)

Please tell me why it is so ?
Reetha
ReplyDelete
Replies
Dhruba BorthakurJanuary 17, 2013 at 9:31 AM
Reetha, HDFS does bufferedIO which means data in HDFS files will be cached in the OS buffer. That is the reason why your CopyToLocal times might be faster the second time around.

Regarding "why reading from slave2" is much faster, I have no idea. You could analyze network traffic, disk IO, etc that was happening simultaneously at the time of your test run to figure out what was going on.
ReplyDelete
Replies
UnknownJanuary 28, 2013 at 7:51 AM
where can i find details on how to use this plug-in??
My requirement is actually very simple.When i place a block, i want to place it in the same datanode as another existing block. So i jus want to retreive the location of the previous block( using FileSystem) and force hdfs to place the new block in the same set of nodes. Will this plug-in be useful for my purpose??
Is there a simpler way to achieve what i want to implement??
ReplyDelete
Replies
UnknownMarch 28, 2013 at 6:52 AM
we would like store 1 of the 3 replicas off-site for (possible) recovery on a site
failure.

we have HDFS cluster as two replication in two different racks( rack0 , rack1). Now we need to keep one more replica iin aonther gepgraphical location for backup.Is it possible to configure? Do you know datacenter rack awareness in hdfs?
ReplyDelete
Replies
UnknownAugust 7, 2013 at 1:55 AM
hey can anyone tell me how can u limiting the blocks to be located on a single node??i am working on this rite nw..
ReplyDelete
Replies
rviSeptember 4, 2013 at 10:29 PM
Sir, I have an one doubt if one file of block stored in same node then how will you get the scalablity at the processing time in context of parallel processing?
ReplyDelete
Replies
AnonymousMay 22, 2014 at 9:27 AM
Does Hadoop 2.2 and on support this feature?
ReplyDelete
Replies
AnonymousMay 22, 2014 at 9:28 AM
does hadoop version 2.2 onward support this?
ReplyDelete
Replies

Add comment

HDFS

Monday, September 14, 2009

HDFS block replica placement in your hands now!

26 comments:

Search This Blog

StatCounter

Followers