HDFS: Hadoop AvatarNode High Availability

Saturday, February 6, 2010

Hadoop AvatarNode High Availability

Our Use-Case

The Hadoop Distributed File System's (HDFS) NameNode is a single point of falure. This has been a major stumbling block in using HDFS for a 24x7 type of deployment. It has been a topic of discussion among a wide circle of engineers.

I am part of a team that is operating a cluster of 1200 nodes and a total size of 12 PB. This cluster is currently running hadoop 0.20. The NameNode is configured to write its transaction logs to a shared NFS filer. The biggest cause of service downtime is when we need to deploy hadoop patches to our cluster. A fix to the DataNode software is easy to deploy without cluster downtime: we deploy new code and restart few DataNodes at a time. We can afford to do this because the DFSClient is equipped (via multiple retries to different replicas of the same block) to handle transient DataNode outages. The major problem arises when new software have to be deployed to our NameNode. A HDFS NameNode restart for a cluster of our size typically takes about an hour. During this time, the applications that use the Hadoop service cannot run. This led us to believe that some kind of High Availabilty (HA) for the NameNode is needed. We wanted a design than can support a failover in about a minute.

Why AvatarNode?

We took a serious look at the options available to us. The HDFS BackupNode is not available in 0.20, and upgrading our cluster to hadoop 0.20 is not a feasible option for us because it has to go through extensive time-consuming testing before we can deploy it in production. We started off white-boarding the simplest solution: the existance of a Primary NameNode and a Standby NameNode in our HDFS cluster. And we did not have to worry about split-brain-scenario and IO-fencing methods... the administrator ensures that the Primary NameNode is really, really dead before coverting the Standby NamenNode to become the Primary DataNode. We also did not want to change the code of the NameNode one single bit and wanted to build HA as a software layer on top of HDFS. We wanted a daemon that can switch from being a Primary to a Standby and vice-versa... as if switching avatars ... and this is how our AvatarNode is born!

The word avatar means a variant-phase or incarnations of a single entity, and we thought it appropriate that a wrapper that can make the NameNode behave as two different roles be aptly named as AvatarNode. Our design of the AvatarNode is such that it is a pure-wrapper around the existing NameNode that exists in Hadoop 0.20; thus the AvatarNode inherits the scalability, performance and robustness of the NameNode.

What is a AvatarNode?

The AvatarNode encapsulates an instance of the NameNode. You can start a AvatarNode in the Primary avatar or the Standby avatar. If you start it as a Primary avatar, then it behaves exactly as the NameNode you know now. In fact, it runs exactly the same code as the NameNode. The AvatarNode in its Primary avatar writes the HDFS transaction logs into the shared NFS filer. You can then start another instance of the AvatarNode on a different machine, but this time telling it to start as a Standby avatar.

The Standby AvatarNode encapsulates a NameNode and a SecondaryNameNode within it. The Standby AvatarNode continuously keeps reading the HDFS transaction logs from the same NFS filer and keeps feeding those transactions to the encapsulated NameNode instance. The NameNode that runs within the Standby AvatarNode is kept in SafeMode. Keeping the Standby AvatarNode is SafeMode prevents it from performing any active duties of the NameNode but at the same time keeping a copy of all NameNode metadata hot in its in-memory data structures.

HDFS clients are configured to access the NameNode via a Virtual IP Address (VIP). At the time of failover, the administrator kills the Primary AvatarNode on machine M1 and instructs the Standby AvatarNode on machine M2 (via a command line utility) to assume a Primary avatar. It is guaranteed that the Standby AvatarNode ingests all committed transactions because it reopens the edits log and consumes all transactions till the end of the file; this guarantee depends on the fact that NFS-v3 supports close-to-open cache coherency semantics. The Standby AvatarNode finishes ingestion of all transactions from the shared NFS filer and then leaves SafeMode. Then the administrator switches the VIP from M1 to M2 thus causing all HDFS clients to start accessing the NameNode instance on M2. This is practically instantaneous and the failover typically takes a few seconds!

There is no need to run a separate instance of the SecondaryNameNode. The AvatarNode, when run in the Standby avatar, performs the duties of the SecondaryNameNode too. The Standby AvatarNode comtinually ingests transactions from the Primary, but once a while it pauses the ingestion, invokes the SecondaryNameNode to create and upload a checkpoint to the Primary AvatarNode and then resumes the ingestion of transaction logs.

The AvatarDataNode

The AvatarDataNode is a wrapper around the vanilla DataNode found in hadoop 0.20. It is configured to send block reports and blockReceived messages to two AvatarNodes. The AvatarDataNode do not use the VIP to talk to the AvatarNode(s) (only HDFS clients use the VIP). An alternative to this approach would have been to make the Primary AvatarNode forward block reports to the Standby AvatarNode. We discounted this alternative approach because it adds code complexity to the Primary AvatarNode: the Primary AvatarNode would have to do buffering and flow control as part of forwarding block reports.

The Hadoop DataNode is already equipped to send (and retry if needed) blockReceived messages to the NameNode. We extend this functionality by making the AvatarDataNode send blockReceived messages to both the Primary and Standby AvatarNodes.

Does the failover affect HDFS clients?

An HDFS client that was reading a file caches the location of the replicas of almost all blocks of a file at the time of opening the file. This means that HDFS readers are not impacted by AvatarNode failover.

What happens to files that were being written by a client when the failover occurred? The Hadoop 0.20 NameNode does not record new block allocations to a file until the time when the file is closed. This means that a client that was writing to a file when the failover occured will encounter an IO exception after a successful failover event. This is somewhat tolerable for map-reduce job because the Hadoop MapReduce framework will retry any failed tasks. This solution also works well for HBase because the HBase issues a sync/hflush call to persist HDFS file contents before marking a HBase-transaction to be completed. In future, it is possible that HDFS may be enhanced to record every new block allocation in the HDFS transaction log, details here. In that case, a failover event will not impact HDFS writers as well.

Other NameNode Failover techniques

People have asked me to compare the AvatarNode HA implementation with the Hadoop with DRBD and Linux HA. Both approaches require that the Primary writes transaction logs to a shared device. The difference is that the Standby AvatarNode is a hot standby whereas the DRBD-LinuxHA solution is a cold standby. The failover time for our AvatarNode is about a minute, whereas the failover time of the DRBD-LinuxHA would require about an hour for our cluster that has about 50 million files. The reason that the AvatarNode can function as a hot standby is because the Standby AvatarNode has an instance of the NameNode encapsulated within it and this encapsulated NameNode receives messages from the DataNodes thus keeping its metadata state fully upto-date.

Another implementation that I have heard about is from a team at China Mobile Research Institue, their approach is described here.

Where is the code?

This code has been contributed to the Apache HDFS project via HDFS-976. A prerequisite for this patch is HDFS-966.

61 comments:

AnonymousFebruary 14, 2010 at 10:16 AM
Isn't this the job for a ClusterWare solution like the linux-ha project with PaceMaker and DRBD configured?

I'm sure the system you outline works, but it seems a bit of a, well, developer's solution.

Now you're going to have to maintain the codebase for the Avatar wrappers, but instead you could have made Red Hat support it.
ReplyDelete
Replies
Dhruba BorthakurFebruary 14, 2010 at 12:24 PM
Thanks for your comments. The problem with just the DRBD-Linux solution is that the Standby cannot be a hot standby. in that solution, the primary NameNode will write its transaction logs to a ext3-like filesystem on a shared disk. When a failover event occurs, you have to make the shared disk accessible to the Standby machine, and then the StandbyNameNode has to do the following:

1. read in the transaction log from the ext3 filesystem
2. process block reports from all

Steps 1 takes about 8 minutes and step 2 takes about 25 minutes for a HDFS file thathas 50 million files. Thus, the total failover time for the HDFS NameNode is more than half hour.
ReplyDelete
Replies
AnonymousMarch 2, 2010 at 1:47 PM
Thanks for all of the great info on this topic (here & in other posts).

Elsewhere, you mention you're familiar with efforts to make HDFS store transaction logs in Zookeeper (http://issues.apache.org/jira/browse/HDFS-234). Could a cluster using AvatarNodes avoid NFS using that approach? How easy or difficult might this be using existing classes/interfaces?
ReplyDelete
Replies
Dhruba BorthakurMarch 2, 2010 at 10:21 PM
The AvatarNode approach is very much compatible with using BookKeeper/Zookeeper (instead of NFS). The Primary AvatarNode could write its edits logs to Bookeeper/Zookeeper while the standby AvatarNode could continue to read the edits log from the same Bookeeper. This will ensure that the Standby AvatarNode closely follows the transactions that occur in the primary namenode.

HDFS-234 discusses integration of Namenode with Bookeeper. If that is done, then there is no additional work to make in work well with AvatarNode.

hope this explanation helps,
dhruba
ReplyDelete
Replies
Dhruba BorthakurMarch 4, 2010 at 9:40 PM
Since HDFS-966 is a prerequisite, I uploaded a patch for this JIRA. You can fetch the patch from http://issues.apache.org/jira/browse/HDFS-966
ReplyDelete
Replies
UnknownApril 4, 2010 at 3:17 AM
Hi I have Taken the Avatar Node Patch.
Can you please help ...how to switch from AVatar standby to Active if Main active node fails.

From Read Me..I have not got more info related to configurations.( no where mentioned about dfs1.http.address...related)
ReplyDelete
Replies
Dhruba BorthakurApril 11, 2010 at 9:53 PM
hi siva, thanks for your comments. I will update the patch in jira HDFS-976. It will have better documentation. Will appreciate your feedback on it.
ReplyDelete
Replies
luo liJuly 11, 2010 at 9:08 PM
hi Dhruba: thanks for your nice idea of namenode HA,we encounter the same problem on hdfs upgrade and restart,I know that the time consumed on the restart of hdfs is partly on the loading of the fsimage and editlog, and partly on the block reports, but the ratio is hard to compute because we can't perform the real environment testing on the test cluster, Is there any info about this issue?
ReplyDelete
Replies
Dhruba BorthakurJuly 12, 2010 at 1:39 AM
our clyster has 2000 nodes. The fsimage is about 12 GB. The cluster has a total of around 90 milliosn blocks. It takes about 6 minutes to read and process the fsimage file. Then it takes another 35 minutes to process block reports from all datanodes.
ReplyDelete
Replies
luo liJuly 14, 2010 at 8:14 PM
and then how many files and directories in the 12GB fsimage? I find out that the time spend on loadFsImage almost consumed in the loop of the method of loadFSImage(File curFile) in FSImage.java and the more files and dires,the more time it takes.
ReplyDelete
Replies
Dhruba BorthakurJuly 14, 2010 at 10:34 PM
there are about total of 70 million files and directories. It takes about 6 minutes to read the entire 12 GB fsimage file.
ReplyDelete
Replies
luo liJuly 20, 2010 at 6:37 AM
hi Dhruba,I noticed that when NN restart,it takes a long time when call the DatanodeDescriptor.reportDiff() on every dn reports it's blocks,I think it is not neccessary to do this when NN just restart,and also I debuged the nn code and found that when nn restart, the toRemove and toInvalidate list after reportDiff are all empty. This also lead me to think that it is not neccessary to reportDiff when restart NN.Did I get anything wrong?
ReplyDelete
Replies
Dhruba BorthakurJuly 20, 2010 at 12:53 PM
Thanks for the proposal. in fact, we have deployed such a fix to our clusters and I had posted the patch here:
http://issues.apache.org/jira/browse/HDFS-1295
ReplyDelete
Replies
AnonymousJuly 27, 2010 at 8:08 AM
Does anyone know a place to get Hadoop patches for 0.20.2 I keep getting errors when i try to use patches from

https://issues.apache.org/jira/browse/HDFS-630
ReplyDelete
Replies
luo liAugust 1, 2010 at 8:06 PM
hi Dhruba,in http://issues.apache.org/jira/browse/HDFS-1295 ,you extended the attached unit test to gather the average time to process the first block report from a datanode,can I get that tools? That will greatly helps!
ReplyDelete
Replies
AnonymousAugust 4, 2010 at 7:23 AM
What's the reason for the one minute latency in failover? Is this the time for administrators to recognize the primary is dead, or is this additional system time for the failover?

I wonder if you couldn't also have application-level reliability by letting your DFS client retry failed requests to one or more avatar standby namenodes. Since the avatar's namenode is in safe mode it would only allow that for reads, but for some applications that would still be very helpful.
ReplyDelete
Replies
Dhruba BorthakurAugust 5, 2010 at 10:01 AM
@luo: the patch for HDFS-1295 has the change to the unit test to measure the improvement in block report processing. I do not have a separate tool to measure it on a live system.

@rbodkin: the failover takes less than a minute, typically 10-20 seconds. this latency is partly because we have to switch the Virtual IP Address (VIP) from one machine to another.

On a related note,we are exploring options to make the Standby AvatarNode serve read-only requests from live applications.
ReplyDelete
Replies
Sridhar ChellappaSeptember 17, 2010 at 8:03 PM
Dhruba,

This does look like a cool thing. Specifically, I am loving the "Unchanged" Avatar NameNode. However, I do see some problems with this approach, though yet to practically observe it.

1. Can cause problems for Name Node Crashes. What the Primary Avatar Node is doing is to write its Transaction log to NFS. Transaction log replay time over NFS can be orders of magnitude higher, causing delays during NameNode Crash recovery.

2. The Primary and Standby need to be on the same subnet for the VIP to be effective. This can potentially Hog Network bandwidth(Writing and Reading Transaction logs continuously) causing additional delays during shuffling. The problem multiplies when you have Avatar Replicas of a Federated NameNode.

3. The problem I have often seen is a non-responsive Name Node. The NameNode takes a Giant Lock to process a Request, causing the Number of outstanding Requests to build up. The Avatar node provides HA but adds to this problem; the latency per request is more because - By writing over NFS - the latency per request increases; thereby INCREASING the chances of an unresponsive NameNode.

Did you observe any of the above ? If yes, how did you work around it ?
ReplyDelete
Replies
Dhruba BorthakurSeptember 19, 2010 at 1:50 AM
Hi Sridhar, thanks for your comments.

@1: I agree that NFS is not the fastest way to read/access that transaction log. However, it is not really NFS but rather the NFS implementation that could be an issue. We use a NetApp NFS Filer and the filer's uptime and latencies are hard to beat! Also, we depend on NFS close-to-open cache coherency semantics: http://nfs.sourceforge.net/

@2: The VIP is an issue. We found this while testing and we are moving to a non-VIP setup where the clients' are actually aware (via zookeeper) who the primary is. If the client is unable to connect to the currently known primary, it polls zookeeper to see if the primary has changed, and if so, then it retries the connection to the new primary. This approach completely removes any dependency on VIP.

@3: Again, the latency supported by a NetApp NFS filer is hard to beat. I agree that if your NFS server is another commodity machine then you could suffer from increased latency. Also the writes to the transaction log are batched and is done outside the global FSNamesystem lock.

We measure and monitor namenode latencies. We are running a namanode with 500+ threads, and we see that the RPC call queue size hardly becomes non-zero, thereby implying that our current workload is not bottlenecked on the transaction log.
ReplyDelete
Replies
Dhruba BorthakurSeptember 21, 2010 at 10:13 PM
For the client failover via zookeper, we have a DistributedAvatarFileSystem (DAFS) layered over DistributedFileSystem for all clients. The DAFS contacts zk to figure out who the current primary is.

Regarding the NetApp filer issues, the latencies of sync are in the order of 10 ms. This, by itself, is not bad especially because the NN does batch syncs and does not hold locks while doing the syncs. also, the edits log is always written sequentially by the primary and read sequentially by the standby. Thus, reads are probably mostly serviced by the read-ahead code in the nfs client on the standby machine ( I do not have hard numbers).

> The main reason for Hadoop is to provide Reliability, Scalability and PErformance over cheap commodity hardware to minimize costs. From this standpoint, NetApp Filers are not a Fit into the Hadoop Cluster.

I agree with you here. It would be nice to be able to design something that uses commodity hardware but still reaches the latency/uptime/durability guarantees of a NetApp filer.
ReplyDelete
Replies
AnonymousOctober 26, 2010 at 3:27 PM
Hello, where can I find code to run hdfs clients to talk to namenodes via zookeeper? Thanks.
ReplyDelete
Replies
Dhruba BorthakurOctober 26, 2010 at 10:13 PM
We are in the process of open-sourcing the AvatarNode integration with zookeeper. We will post this patch as part of HDFS-976 very soon,
ReplyDelete
Replies
UnknownOctober 28, 2010 at 3:56 PM
Dhruba, any advice on applying your patches to CDH3b2?.. The FSImage API seems incompatible with your code, but since I'm not sure how to track down the versions that work with your posted patches, I don't have a feel for what it would take to fix it.
ReplyDelete
Replies
Dhruba BorthakurOctober 28, 2010 at 9:42 PM
hi dnquark, we will post a most recent version of this patch on HDFS-976 before the end of of Nov. That will be the right one for you to integrate with CDH
ReplyDelete
Replies
AnonymousJanuary 8, 2011 at 8:58 PM
Hello, does HDFS-976 incorporates HBASE handling in a sense that when zookeeper quorum becomes aware that primary avatarnode is dead, all HBASE servers, master and region, must then switch to a secondary Avatarnode?

Would you mind checking in or posting sample config to get started with avatarnode setup?

Thanks.
-Jack
ReplyDelete
Replies
Dhruba BorthakurJanuary 9, 2011 at 9:10 PM
This comment has been removed by the author.
ReplyDelete
Replies
Dhruba BorthakurJanuary 9, 2011 at 9:12 PM
hi jack, hbase is a layer on top of hdfs. When the hdfs avatarnode failover occurs, the hbase application (region servers, master, etc) wil pause for a few seconds before automatically using the new primary avtarnode. We wil post a new version of the avatrnode code into HDFS-976, but you can see the 0.20 code base that we are running here: http://github.com/facebook/hadoop-20-warehouse/tree/master/src/contrib/highavailability
ReplyDelete
Replies
AnonymousJanuary 10, 2011 at 10:57 AM
So, the namenode IP does not change? HBASE does specify namenode "url" to talk to its hdfs root dir. If IP does change when avatar nodes switch, then how would hbase be able to access its hdfs directory with an IP pointing to a server thats down? (downed namenode)
ReplyDelete
Replies
Dhruba BorthakurJanuary 10, 2011 at 3:27 PM
Avatarnode uses zookeeper. Zookeeper has a list of two machine names that run the AvatarNodes. All clients a layered file system called DistributedAvatarFileSystem (DAFS). DAFS retrieves the name of the machine that is marked as "primary" in zk, and then connects to it. If it fails, then rechecks with zk to see if the primary has changed, and if so, then it reconnects to the new primary. All these are completely transparent to the application.
ReplyDelete
Replies
Mark KerznerFebruary 14, 2011 at 1:14 PM
Dhruba,

thank you for sharing this very valuable information, and even more for sharing the code.

Question - controlled failovers cause no downtime, only performance degradation for DataNodes. What about failovers caused by malfunctioning? I did not see this mentioned.

Thank you,
Mark
ReplyDelete
Replies
Dhruba BorthakurFebruary 14, 2011 at 8:58 PM
failovers caused by malfunctioning is not yet covered by this approach... i.e, there is no automatic failover. If the namenode malfunctions, the cluster will cme to a grinding halt till the time when the administrator runs a manual command to trigger the failover
ReplyDelete
Replies
JohnBFebruary 18, 2011 at 9:57 AM
Dhruba,

Is there any chance this is available in a packaged format, either deb or rpm?

We would love to give this a try but do not have an environment set up to apply the patches.

Thanks much,
John
ReplyDelete
Replies
Dhruba BorthakurFebruary 19, 2011 at 6:29 PM
The code that we are running are at https://github.com/facebook/hadoop-20-warehouse/tree/master/src/contrib/highavailability. Unfortunately, we do not have a rpm or any other packaged form: you would have to compile and package it yourself.

the code at the above URL is about 2 months old, we have since then made some important fixes to HA. We will be re-uploading our latest version of the code at the same URL very shortly.
ReplyDelete
Replies
Thanh DoMarch 8, 2011 at 2:13 PM
Hi Dhruba,

Very interesting work! Can you share with us how much memory does the master have?

Thanks!
Thanh
ReplyDelete
Replies
Dhruba BorthakurMarch 8, 2011 at 2:19 PM
The master had 64 GB of memory.
ReplyDelete
Replies
UnknownMarch 12, 2011 at 4:49 AM
Hi Dhruba,
Why does the AvatarNode have to read fstime of the remote node when it is formatted??

Thank!
wl
ReplyDelete
Replies
Dhruba BorthakurMarch 12, 2011 at 10:55 PM
when the avaratnode-standby starts up, it reads the fstime of the remote primary namenode to remember the time wen the primary ast checkpointed. if, somehow, the primary checkpoints again by itself (without consulting the standby), the standby can still detect this fact by looking at the fstime file.
ReplyDelete
Replies
AnonymousMay 20, 2011 at 12:01 PM
Elegant solution.

Was curious to know what the I/O rates are for the secondary trying to keep up with the primary.

Did you consider a fast interconnect between primary and secondary as a means for the secondary to read transactions and go to the filer only as a fallback?

Regards
TW @shrusamira
ReplyDelete
Replies
Dhruba BorthakurMay 22, 2011 at 5:54 PM
The workload on our largest 30 PB HDFS cluster writes t transaction log at around 6 MByte/second. We do not use a fast -interconnect.
ReplyDelete
Replies
Lars FroniusMay 25, 2011 at 1:43 AM
Is it possible do have both Avatar/NameNodes in a state where they can write? If so you could put an LVS-Loadbalancer in the front that can hold the VIP and can route to NameNodes through direct-routing which would also solve the problem of the NameNode as a bottleneck in HDFS.

If this is not possible I do want to make a remark on the first comment. I do like pacemaker very much for it's flexibility, and I think it could fit here. DRBD might be the false approach, but you can have a shared-storage like NFS as you did and put your NameNodes as Cloned-Ressources (or Master-Slave) through pacemaker on the different Hosts. On of them gets the VIP. You do have a hot-standby through the cloned/master-slave-ressource-concept and you are very flexible in scripting for it through the OCF-Framework. Pacemaker does the monitoring and failover for you.

I think it would be nice to have this integration into the "standard" linux-ha suite.
ReplyDelete
Replies
Dhruba BorthakurMay 25, 2011 at 2:13 AM
Hi Lars, when we talk about VIP, isn't it true that the pair of machines have to be in the same subnet?
ReplyDelete
Replies
Lars FroniusMay 27, 2011 at 4:04 AM
It belongs.
They don't need to, but they need to have one interface in it, cause it is only one unique VIP that belongs to only one subnet if I understand you right? It should be made available somehow on the machines.
Everything else can be subnet-indepedendent. Pacemaker uses Multicast often to have a cluster-wide communication which is independent to the underlying topology.
ReplyDelete
Replies
AnonymousJune 12, 2011 at 8:01 PM
Hi, would you please tell me how to configure and startup?
and you metioned that have to 'runs a manual command to trigger the failover' , and what is it?
Thank you very much
ReplyDelete
Replies
Thanh D. DoJune 26, 2011 at 9:55 AM
Hi Dhruba!

I am curious about the restart time of non-avatar NameNode. You mentioned that it could take 1 hour to restart, including:
- 6 minutes to load the 12GB fsimage
- 35 minutes to process block reports
So, did it take about 20 minutes to process the transaction log? What is the typical size of a edits file in your cluster?
ReplyDelete
Replies
Dhruba BorthakurJune 27, 2011 at 4:34 AM
Hi Thanh, our average transaction log is about 2GB every half hour, but peaks are much higher than this.
ReplyDelete
Replies
Thanh D. DoJune 27, 2011 at 7:40 AM
Dhruba, if I remember correctly, the checkpoint will be done once the edits file exceeds 64MB right? So restarting a non-avatar NameNode shouldn't spend much time on processing the transaction log. And the major time spent on restarting comes from processing block reports, right? Please correct me if I am wrong.
ReplyDelete
Replies
Dhruba BorthakurJune 27, 2011 at 9:03 AM
absolutely right, most of the restart time is from processing block reports.
ReplyDelete
Replies
Thanh D. DoJuly 3, 2011 at 8:54 AM
Hi Dhruba. Can you shred some light on why processing block reports take so much time?
ReplyDelete
Replies
Thanh D. DoJuly 3, 2011 at 8:59 AM
Konstantine once mentioned in his paper (http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dist/) that NameNode is able to process about 10 block report per second. So if we have a cluster of around 2000 nodes, it will take around 200 seconds to process all block report. Can you comment on this?
ReplyDelete
Replies
Dhruba BorthakurJuly 3, 2011 at 12:05 PM
A block report processing takes about 100 ms. This means we can process about 100 nodes per second. On namenode startup, our 3000 node cluster takes about 5 minutes to process all block reports.
Then it takes another 10-15 minutes to exit safemode, the reason being that when the namenode decides to exit safemode it loops though all existing blocks to determine underreplicated/excess-replicated blocks. Since we have close to 300 million blocks it takes a long long time to loop through all those blocks. Another reason is that since we use HDFS RAIFD, there could be lots of blocks with excess replicas, and excess replica deletion code in the NN is very heavy-weight.
ReplyDelete
Replies
UnknownJuly 3, 2011 at 8:11 PM
Hi Dhruba,

I was working in a version of Avatar Node that uses the BackupNode. But on the paper for SIGMOD '11 is said that this does not bring any advantage over current approach. I though it was because the backup node was not available on 0.20 . I also tought that failures on BackNode would not affect Namenode, but it does not look so. Could you comment on that?

Could also give some overview about the implementation of DAFS? I understood that you replaced the ClientProtocol proxy ( dfs,namenode) by FailoverClientProtocol, which redirect requests to the current active avatar. But I did not understood why in the event of failover, failoverFS, a new DistributedFileSystem, is created and FailoverClientProtocol is set to point to its dfs.namenode. Why couldn't we just create a new ClientProtocol proxy ?

Thanks,
André
ReplyDelete
Replies
Dhruba BorthakurJuly 6, 2011 at 6:25 PM
I am not sure what you mean by "version of Avatar Node that uses the BackupNode", can you pl explain?

My point is that if u use the BackupNode from 0.20 (or 0.22), the system will be as reliable as the combined availability of both nodes (which is typically lower than the availability of just the NN). The NN synchronously sends a transaction to the BN.
ReplyDelete
Replies
UnknownJuly 9, 2011 at 4:22 PM
The datanodes send block reports to both namenode and backupnode, so the last can act as a hot-standby.
ReplyDelete
Replies
Dhruba BorthakurJuly 10, 2011 at 1:03 PM
If you make the BackupNameNode process block reports/blockReceiveds from datanodes, then it will be hot in nature. This is one step closer to the design of the AvatarNode. The difference being that the AvatarNode still uses the shared NFS mount point to store transaction logs.

If you decide to make the primary namenode synchronously replicate transactions to the backupnode, you would have to measure the performance of transactions. will the response from the backupnode be as fast/reliable as an NFS appliance? If the backup node falls out of sync, how quickly/reliably does it get the full transaction log from the primary? These are some issues that are worth measuring on a real cluster.
ReplyDelete
Replies
tiruSeptember 5, 2011 at 4:25 AM
hi I am trying to build avatar node build has faild here with the jsp
Compiling 5 source files to C:\Chaitanya\Projects\Hadoop\Sample\facebook-hadoop-20-warehouse-bbfed86\build\classes
[javac] C:\Chaitanya\Projects\Hadoop\Sample\facebook-hadoop-20-warehouse-bbfed86\build\src\org\apache\hadoop\hdfs\server\namenode\corrupt_005ffiles_jsp.java:77: cannot find symbol
[javac] symbol : method getNamesystem()
[javac] location: class org.apache.hadoop.hdfs.server.namenode.NameNode
[javac] FSNamesystem fsn = nn.getNamesystem();
[javac] ^
[javac] C:\Chaitanya\Projects\Hadoop\Sample\facebook-hadoop-20-warehouse-bbfed86\build\src\org\apache\hadoop\hdfs\server\namenode\corrupt_005ffiles_jsp.java:80: cannot find symbol
[javac] symbol : class CorruptFileBlockInfo
[javac] location: class org.apache.hadoop.hdfs.server.namenode.FSNamesystem
[javac] Collection corruptFileBlocks =
[javac] ^
[javac] C:\Chaitanya\Projects\Hadoop\Sample\facebook-hadoop-20-warehouse-bbfed86\build\src\org\apache\hadoop\hdfs\server\namenode\corrupt_005ffiles_jsp.java:81: cannot find symbol
[javac] symbol : method listCorruptFileBlocks(java.lang.String,)
[javac] location: class org.apache.hadoop.hdfs.server.namenode.FSNamesystem
[javac] fsn.listCorruptFileBlocks("/", null);

please help on this
ReplyDelete
Replies
Dhruba BorthakurSeptember 5, 2011 at 11:11 PM
hi tiru, please do a clean build by first running "ant clean", that should solve your problem.
ReplyDelete
Replies
tiruSeptember 8, 2011 at 3:52 AM
Hi Sid...
I have done that and i am facing this problem after that
ivy-retrieve-common:
[ivy:retrieve] :: retrieving :: org.apache.hadoop#hmon [sync]
[ivy:retrieve] confs: [common]
[ivy:retrieve] 0 artifacts copied, 8 already retrieved (0kB/0ms)
No ivy:settings found for the default reference 'ivy.instance'. A default instance will be used
DEPRECATED: 'ivy.conf.file' is deprecated, use 'ivy.settings.file' instead
:: loading settings :: file = H:\Webproject\hadoop\Hadoop\facebook-hadoop-20-warehouse-bbfed86\ivy\ivysettings.xml
compile:
[echo] contrib: hmon
compile:

BUILD FAILED
H:\Webproject\hadoop\Hadoop\facebook-hadoop-20-warehouse-bbfed86\build.xml:522: The following error occurred while executing this line:
H:\Webproject\hadoop\Hadoop\facebook-hadoop-20-warehouse-bbfed86\src\contrib\build.xml:30: The following error occurred while executing this line:
H:\Webproject\hadoop\Hadoop\facebook-hadoop-20-warehouse-bbfed86\src\contrib\hod\build.xml:29: Execute failed: java.io.IOException: Cannot run program "echo" (in directory "H:\Webproject\hadoop\Hadoop\facebook-hadoop-20-warehouse-bbfed86\src\contrib\hod"): CreateProcess error=2, The system cannot find the file specified
ReplyDelete
Replies
PaulSeptember 8, 2011 at 9:58 AM
Would you mind providing a brief description of the NN/DN heartbeat process from time T1 when NN is alive, through time T2 when NN is dead but AN (Avatar Node) switchover has not yet occurred, through T3 where AN is now what DN heartbeat is talking to. Specifically, I am trying to understand how and when DN knows where to send its heartbeat report. Thanks.
ReplyDelete
Replies
Dhruba BorthakurSeptember 8, 2011 at 12:57 PM
The datanodes send blockReceived and block reports to both AvatarNodes. One of the Avatarnode is running as the primary and the other one as the standby. From the perspective of sending heartbeats, the datanode does not distinguish between primary AvatarNode and standby AvatarNode.

If the datanode cannot send a report to one of the namenodes, it retries for a certain period (with backoff), but if the buffer becomes too big, it drops all the blockReceived and remembers to send a full blockreport at the next successful call to that AvatarNode.
ReplyDelete
Replies
UnknownJanuary 22, 2012 at 8:01 PM
Thanks for sharing it with us. Keep it up!
walmart credit card
ReplyDelete
Replies
AnonymousFebruary 23, 2012 at 5:18 AM
Hi,

I need to do the fail over testing on a cluster having 5-10 nodes.
Can you let me know what are all the cases do i need to take care of?
ReplyDelete
Replies

Add comment

HDFS

Saturday, February 6, 2010

Hadoop AvatarNode High Availability

61 comments:

Search This Blog

StatCounter

Followers