Comments on HDFS: Hadoop AvatarNode High Availability

Hi, I need to do the fail over testing on a clust...

2012-02-23T05:18:52.231-08:00

Hi,

I need to do the fail over testing on a cluster having 5-10 nodes.
Can you let me know what are all the cases do i need to take care of?

Thanks for sharing it with us. Keep it up! walmart...

2012-01-22T20:01:01.733-08:00

Thanks for sharing it with us. Keep it up!
walmart credit card

The datanodes send blockReceived and block reports...

2011-09-08T12:57:40.725-07:00

The datanodes send blockReceived and block reports to both AvatarNodes. One of the Avatarnode is running as the primary and the other one as the standby. From the perspective of sending heartbeats, the datanode does not distinguish between primary AvatarNode and standby AvatarNode.

If the datanode cannot send a report to one of the namenodes, it retries for a certain period (with backoff), but if the buffer becomes too big, it drops all the blockReceived and remembers to send a full blockreport at the next successful call to that AvatarNode.

Would you mind providing a brief description of th...

2011-09-08T09:58:07.701-07:00

Would you mind providing a brief description of the NN/DN heartbeat process from time T1 when NN is alive, through time T2 when NN is dead but AN (Avatar Node) switchover has not yet occurred, through T3 where AN is now what DN heartbeat is talking to. Specifically, I am trying to understand how and when DN knows where to send its heartbeat report. Thanks.

Hi Sid... I have done that and i am facing this pr...

2011-09-08T03:52:25.443-07:00

Hi Sid...
I have done that and i am facing this problem after that
ivy-retrieve-common:
[ivy:retrieve] :: retrieving :: org.apache.hadoop#hmon [sync]
[ivy:retrieve] confs: [common]
[ivy:retrieve] 0 artifacts copied, 8 already retrieved (0kB/0ms)
No ivy:settings found for the default reference 'ivy.instance'. A default instance will be used
DEPRECATED: 'ivy.conf.file' is deprecated, use 'ivy.settings.file' instead
:: loading settings :: file = H:\Webproject\hadoop\Hadoop\facebook-hadoop-20-warehouse-bbfed86\ivy\ivysettings.xml
compile:
[echo] contrib: hmon
compile:

BUILD FAILED
H:\Webproject\hadoop\Hadoop\facebook-hadoop-20-warehouse-bbfed86\build.xml:522: The following error occurred while executing this line:
H:\Webproject\hadoop\Hadoop\facebook-hadoop-20-warehouse-bbfed86\src\contrib\build.xml:30: The following error occurred while executing this line:
H:\Webproject\hadoop\Hadoop\facebook-hadoop-20-warehouse-bbfed86\src\contrib\hod\build.xml:29: Execute failed: java.io.IOException: Cannot run program "echo" (in directory "H:\Webproject\hadoop\Hadoop\facebook-hadoop-20-warehouse-bbfed86\src\contrib\hod"): CreateProcess error=2, The system cannot find the file specified

hi tiru, please do a clean build by first running ...

2011-09-05T23:11:59.265-07:00

hi tiru, please do a clean build by first running "ant clean", that should solve your problem.

hi I am trying to build avatar node build has fail...

2011-09-05T04:25:48.369-07:00

hi I am trying to build avatar node build has faild here with the jsp
Compiling 5 source files to C:\Chaitanya\Projects\Hadoop\Sample\facebook-hadoop-20-warehouse-bbfed86\build\classes
[javac] C:\Chaitanya\Projects\Hadoop\Sample\facebook-hadoop-20-warehouse-bbfed86\build\src\org\apache\hadoop\hdfs\server\namenode\corrupt_005ffiles_jsp.java:77: cannot find symbol
[javac] symbol : method getNamesystem()
[javac] location: class org.apache.hadoop.hdfs.server.namenode.NameNode
[javac] FSNamesystem fsn = nn.getNamesystem();
[javac] ^
[javac] C:\Chaitanya\Projects\Hadoop\Sample\facebook-hadoop-20-warehouse-bbfed86\build\src\org\apache\hadoop\hdfs\server\namenode\corrupt_005ffiles_jsp.java:80: cannot find symbol
[javac] symbol : class CorruptFileBlockInfo
[javac] location: class org.apache.hadoop.hdfs.server.namenode.FSNamesystem
[javac] Collection corruptFileBlocks =
[javac] ^
[javac] C:\Chaitanya\Projects\Hadoop\Sample\facebook-hadoop-20-warehouse-bbfed86\build\src\org\apache\hadoop\hdfs\server\namenode\corrupt_005ffiles_jsp.java:81: cannot find symbol
[javac] symbol : method listCorruptFileBlocks(java.lang.String,)
[javac] location: class org.apache.hadoop.hdfs.server.namenode.FSNamesystem
[javac] fsn.listCorruptFileBlocks("/", null);

please help on this

If you make the BackupNameNode process block repor...

2011-07-10T13:03:47.405-07:00

If you make the BackupNameNode process block reports/blockReceiveds from datanodes, then it will be hot in nature. This is one step closer to the design of the AvatarNode. The difference being that the AvatarNode still uses the shared NFS mount point to store transaction logs.

If you decide to make the primary namenode synchronously replicate transactions to the backupnode, you would have to measure the performance of transactions. will the response from the backupnode be as fast/reliable as an NFS appliance? If the backup node falls out of sync, how quickly/reliably does it get the full transaction log from the primary? These are some issues that are worth measuring on a real cluster.

The datanodes send block reports to both namenode ...

2011-07-09T16:22:22.034-07:00

The datanodes send block reports to both namenode and backupnode, so the last can act as a hot-standby.

I am not sure what you mean by "version of Av...

2011-07-06T18:25:32.669-07:00

I am not sure what you mean by "version of Avatar Node that uses the BackupNode", can you pl explain?

My point is that if u use the BackupNode from 0.20 (or 0.22), the system will be as reliable as the combined availability of both nodes (which is typically lower than the availability of just the NN). The NN synchronously sends a transaction to the BN.

Hi Dhruba, I was working in a version of Avatar...

2011-07-03T20:11:03.029-07:00

Hi Dhruba,

I was working in a version of Avatar Node that uses the BackupNode. But on the paper for SIGMOD '11 is said that this does not bring any advantage over current approach. I though it was because the backup node was not available on 0.20 . I also tought that failures on BackNode would not affect Namenode, but it does not look so. Could you comment on that?

Could also give some overview about the implementation of DAFS? I understood that you replaced the ClientProtocol proxy ( dfs,namenode) by FailoverClientProtocol, which redirect requests to the current active avatar. But I did not understood why in the event of failover, failoverFS, a new DistributedFileSystem, is created and FailoverClientProtocol is set to point to its dfs.namenode. Why couldn't we just create a new ClientProtocol proxy ?

Thanks,
André

A block report processing takes about 100 ms. This...

2011-07-03T12:05:31.703-07:00

A block report processing takes about 100 ms. This means we can process about 100 nodes per second. On namenode startup, our 3000 node cluster takes about 5 minutes to process all block reports.
Then it takes another 10-15 minutes to exit safemode, the reason being that when the namenode decides to exit safemode it loops though all existing blocks to determine underreplicated/excess-replicated blocks. Since we have close to 300 million blocks it takes a long long time to loop through all those blocks. Another reason is that since we use HDFS RAIFD, there could be lots of blocks with excess replicas, and excess replica deletion code in the NN is very heavy-weight.

Konstantine once mentioned in his paper (http://de...

2011-07-03T08:59:44.497-07:00

Konstantine once mentioned in his paper (http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dist/) that NameNode is able to process about 10 block report per second. So if we have a cluster of around 2000 nodes, it will take around 200 seconds to process all block report. Can you comment on this?

Hi Dhruba. Can you shred some light on why process...

2011-07-03T08:54:53.925-07:00

Hi Dhruba. Can you shred some light on why processing block reports take so much time?

absolutely right, most of the restart time is from...

2011-06-27T09:03:14.554-07:00

absolutely right, most of the restart time is from processing block reports.

Dhruba, if I remember correctly, the checkpoint wi...

2011-06-27T07:40:52.712-07:00

Dhruba, if I remember correctly, the checkpoint will be done once the edits file exceeds 64MB right? So restarting a non-avatar NameNode shouldn't spend much time on processing the transaction log. And the major time spent on restarting comes from processing block reports, right? Please correct me if I am wrong.

Hi Thanh, our average transaction log is about 2GB...

2011-06-27T04:34:58.171-07:00

Hi Thanh, our average transaction log is about 2GB every half hour, but peaks are much higher than this.

Hi Dhruba! I am curious about the restart time o...

2011-06-26T09:55:17.497-07:00

Hi Dhruba!

I am curious about the restart time of non-avatar NameNode. You mentioned that it could take 1 hour to restart, including:
- 6 minutes to load the 12GB fsimage
- 35 minutes to process block reports
So, did it take about 20 minutes to process the transaction log? What is the typical size of a edits file in your cluster?

Hi, would you please tell me how to configure and ...

2011-06-12T20:01:58.579-07:00

Hi, would you please tell me how to configure and startup?
and you metioned that have to 'runs a manual command to trigger the failover' , and what is it?
Thank you very much

It belongs. They don't need to, but they need...

2011-05-27T04:04:04.333-07:00

It belongs.
They don't need to, but they need to have one interface in it, cause it is only one unique VIP that belongs to only one subnet if I understand you right? It should be made available somehow on the machines.
Everything else can be subnet-indepedendent. Pacemaker uses Multicast often to have a cluster-wide communication which is independent to the underlying topology.

Hi Lars, when we talk about VIP, isn't it true...

2011-05-25T02:13:30.525-07:00

Hi Lars, when we talk about VIP, isn't it true that the pair of machines have to be in the same subnet?

Is it possible do have both Avatar/NameNodes in a ...

2011-05-25T01:43:11.281-07:00

Is it possible do have both Avatar/NameNodes in a state where they can write? If so you could put an LVS-Loadbalancer in the front that can hold the VIP and can route to NameNodes through direct-routing which would also solve the problem of the NameNode as a bottleneck in HDFS.

If this is not possible I do want to make a remark on the first comment. I do like pacemaker very much for it's flexibility, and I think it could fit here. DRBD might be the false approach, but you can have a shared-storage like NFS as you did and put your NameNodes as Cloned-Ressources (or Master-Slave) through pacemaker on the different Hosts. On of them gets the VIP. You do have a hot-standby through the cloned/master-slave-ressource-concept and you are very flexible in scripting for it through the OCF-Framework. Pacemaker does the monitoring and failover for you.

I think it would be nice to have this integration into the "standard" linux-ha suite.

The workload on our largest 30 PB HDFS cluster wri...

2011-05-22T17:54:53.055-07:00

The workload on our largest 30 PB HDFS cluster writes t transaction log at around 6 MByte/second. We do not use a fast -interconnect.

Elegant solution. Was curious to know what the I/...

2011-05-20T12:01:35.462-07:00

Elegant solution.

Was curious to know what the I/O rates are for the secondary trying to keep up with the primary.

Did you consider a fast interconnect between primary and secondary as a means for the secondary to read transactions and go to the filer only as a fallback?

Regards
TW @shrusamira

when the avaratnode-standby starts up, it reads th...

2011-03-12T22:55:54.054-08:00

when the avaratnode-standby starts up, it reads the fstime of the remote primary namenode to remember the time wen the primary ast checkpointed. if, somehow, the primary checkpoints again by itself (without consulting the standby), the standby can still detect this fact by looking at the fstime file.