HDFS: Hadoop Research Topics

Sunday, November 21, 2010

Hadoop Research Topics

Recently, I visited a few premier educational institutes in India, e.g. Indian Institute of Technology (IIT) at Delhi and Guwahati. Most of the undergraduate students at these two institutes are somewhat familiar with Hadoop and would like to work on Hadoop related projects as part of their course work. One commonly asked question that I got from these students is what Hadoop feature can I work on?

Here are some items that I have in mind that are good topics for students to attempt if they want to work in Hadoop.

Ability to make Hadoop scheduler resource aware, especially CPU, memory and IO resources. The current implementation is based on statically configured slots.
Abilty to make a map-reduce job take new input splits even after a map-reduce job has already started.
Ability to dynamically increase replicas of data in HDFS based on access patterns. This is needed to handle hot-spots of data.
Ability to extend the map-reduce framework to be able to process data that resides partly in memory. One assumption of the current implementation is that the map-reduce framework is used to scan data that resides on disk devices. But memory on commodity machines is becoming larger and larger. A cluster of 3000 machines with 64 GB each can keep about 200TB of data in memory! It would be nice if the hadoop framework can support caching the hot set of data on the RAM of the tasktracker machines. Performance should increase dramatically because it is costly to serialize/compress data from the disk into memory for every query.
Heuristics to efficiently 'speculate' map-reduce tasks to help work around machines that are laggards. In the cloud, the biggest challenge for fault tolerance is not to handle failures but rather anomalies that makes parts of the cloud slow (but not fail completely), these impact performance of jobs.
Make map-reduce jobs work across data centers. In many cases, a single hadoop cluster cannot fit into a single data center and a user has to partition the dataset into two hadoop clusters in two different data centers.
High Availability of the JobTracker. In the current implementation, if the JobTracker machine dies, then all currently running jobs fail.
Ability to create snapshots in HDFS. The primary use of these snapshots is to retrieve a dataset that was erroneously modified/deleted by a buggy application.

The first thing for a student who wants to do any of these projects is to download the code from HDFS and MAPREDUCE. Then create an account in the bug tracking software here. Please search for an existing JIRA that describes your project; if none exists then please create a new JIRA. Then please write a design document proposal so that the greater Apache Hadoop community can deliberate on the proposal and post this document to the relevant JIRA.

If anybody else have any new project ideas, please add them as comments to this blog post.

44 comments:

AnonymousNovember 23, 2010 at 12:42 AM
Hi,

I recently wrote two emails on HDFS-dev about the same question. These were my suggestions:

you could write a developer documentation of the inner workings of HDFS
(+HBASE, +MAPREDUCE?) that could be understood by HDFS users. Additionally to
the documentation of the current state, you could include:

- Different strategies to make the NameNode distributed
- The different Approaches to append
- How does Security with Kerberos work?

One of the challenges of such a work would be to make it as easy as possible
for developers to understand some part of HDFS they're interested in.
Another challenge is to choose a documentation format and workflow that would
make it easy to keep this documentation current without much effort.

A totally other project that I also consider important for Hadoop: Help Apache
to implement an infrastructure based on GIT. This could help many projects in
the long run. If you're interested in this, you should subscribe to
infrastructure-dev(at)apache.org and get in contact with Jukka Zitting.

- Yesterday I got hit by MAPREDUCE-1283[1]. This issue by itself is of course
not enough for three months, but my idea is that you could in general have a
look over the developer tools and what's missing, what needs improvement.

- HBasene[2] is a project to store a lucene index natively on BigTable. It was
inspired by Lucandra[3]. The HBasene project however has stalled. Still it
would be a very promissing project IMHO especially considering the upcoming
talk on Googles new Index infrastructure Percolator[4] that uses BigTable to
store the index.

- A backup system on top of HBase. It should try to store similar files near
to each other so that tablet compression can work best. HBase's timestamps
could be used to hold several versions of a file and let HBase handle the
expiration of old versions of files.
As an additional task you could evaluate the feasability of installing HBase
as a backup system with office desktop computers as regionservers. This could
utilize otherwise unused hard drive space.

[1] https://issues.apache.org/jira/browse/MAPREDUCE-1283
[2] http://github.com/akkumar/hbasene
[3] http://github.com/tjake/Lucandra
[4] http://www.theregister.co.uk/2010/09/24/google_percolator/
ReplyDelete
Replies
Dhruba BorthakurNovember 27, 2010 at 11:02 PM
Thanks thkoch for your comments.
ReplyDelete
Replies
HarshJanuary 1, 2011 at 5:45 AM
Hello Dhruba,

I'd love to attack a specific one on your list -- Has work begun related to the third point feature yet? If it has been, could you guide me to the associated JIRA page so I may butt in and/or begin?
ReplyDelete
Replies
HarshJanuary 1, 2011 at 11:20 AM
Ok, I found it -- HDFS-782
ReplyDelete
Replies
Dhruba BorthakurJanuary 1, 2011 at 10:42 PM
@Harsh, here are some more details: http://issues.apache.org/jira/browse/HDFS-782
ReplyDelete
Replies
gopiJanuary 10, 2011 at 11:17 PM
@ ppl,
Could you suggest any good topic in hdfs that can be possibly done with in a semester.Thank you.
ReplyDelete
Replies
Dhruba BorthakurJanuary 11, 2011 at 10:31 AM
Another project that can be done in a month or so:

a distributed copy (similar to distcp http://hadoop.apache.org/common/docs/r0.19.2/distcp.html, but instead of copying file-after-file, copy all blocks in parallel (using the new HDFS concat API http://issues.apache.org/jira/browse/HDFS-222) and then stiching al the blocks together into appropriate files at the destination.

You can discuss this in more detail via http://issues.apache.org/jira/browse/MAPREDUCE-2257
ReplyDelete
Replies
gopiJanuary 11, 2011 at 9:47 PM
Thanks dhruba, It sounds good, I hope I can come up with something.
ReplyDelete
Replies
LuisJanuary 25, 2011 at 11:27 AM
Hello Dhruba,

I wonder if the point five (Heuristics to efficiently 'speculate'...) is related with the issue https://issues.apache.org/jira/browse/HDFS-97 or with https://issues.apache.org/jira/browse/HADOOP-3585 . Is there any document which contains more info?

Actually, I'm working in a propose to deliver high-availability in HDFS clusters, however I see you also are working on it (https://issues.apache.org/jira/browse/HDFS-839) . Is there any account where I can send you my propose and see if it could work with the point five?

Thanks,

Luis
ReplyDelete
Replies
Lizhe WangFebruary 9, 2011 at 1:35 PM
Quite good thoughts,
I agree most of your ideas.
In fact, we are developing one project of run Hadoop on multiple data centers. Some other ideas that might be interesting, like energy aware hadoop that reduce the max temperature of a hadoop cluster; provenance support for Hadoop.
ReplyDelete
Replies
Lizhe WangFebruary 9, 2011 at 1:55 PM
Another question, do you or any one can convince me that the lineally streaming copy from one block to another has advantages over alternative solutions?
How about copy data from one block to another in a binary tree or binary fat tree like method?
ReplyDelete
Replies
Dhruba BorthakurFebruary 10, 2011 at 12:54 AM
hi lizhe, i agree that there could be quite a few advantages to using a binary tree to do data copy.
ReplyDelete
Replies
Min ZhouFebruary 28, 2011 at 1:56 AM
hi Dhruba,
How do you plan to take advantage of in-memory cache reside in TaskTracker? There should be a lot of table joins among the offline analysis workloads in Facebook.

If do joins on table A and B, like select * from A join B on A.id = B.id where ..., and B is an in-memory table, A is an on-disk (on-HDFS, more accurately) one. The MR job should scan both in-memory and on-disk tables during the map stage. In-memory scans should be finished quickly, however, they need to wait for the much slower on-disk scans. Hence, it doesn't significantly improve performance.
ReplyDelete
Replies
Dhruba BorthakurFebruary 28, 2011 at 9:34 PM
Hi Min, a majority of our workload works on data (joins, filters, etc) on a subset of our data. For example, most queries are for the most recent day, followed by the most recent week. So, if we can put all the data for the last day (or last week) entirely in memory, then we can do any kind of analysis (filters, join, aggregate) purely on in-memory based data.

On the other hand, if the entire dataset is not in memory, then the problems u point out are entirely likely to occur.
ReplyDelete
Replies
Min ZhouMarch 1, 2011 at 6:49 PM
This comment has been removed by the author.
ReplyDelete
Replies
Min ZhouMarch 1, 2011 at 6:53 PM
Hi Dhruba,

Thanks for your reply. The workloads you described are very much like ours. We often scan data produced on the last day.

Even if all of those data were warmed into Tasktracker's cache, however, I think there should be another problem. The dependency relations among the SQL queries are very common. So the intermediate tables, produced by the queries, are clearly much needed to be kept in-memory for gaining a better performance. Thus more memory footprint is needed. If not, the subsequent queries could't benefit from this mechanism.
ReplyDelete
Replies
AnonymousMarch 11, 2011 at 1:10 PM
Do you have any performance metrics to evaluate whether an application is data intensive?
or any good benchmark for Hadoop in addition to terasort?
any Hadoop version of Graph computing?
ReplyDelete
Replies
Dhruba BorthakurMarch 12, 2011 at 11:00 PM
each map-reduce task reports the amount of CPU it has consumed as well as the amount of data it has processed. It is feasible to make the JT calculate the ratio of these two numbers... a lower ratio will indicate a more data-intensive program.

I do not use any Hadoop code for general purpose graph computing.
ReplyDelete
Replies
Jen GoldmanApril 5, 2011 at 3:11 PM
Hello Dhruba,

Return Path is an award winning mid-size company located outside Boulder, Colorado. Our mission is to bring an end to messaging abuse by identifying, improving, and certifying all the legitimate email in the world. We work with some of the largest email senders in the world as well as the largest ISPs. We're looking for a Hadoop Sys Admin to join the team. Learn more and apply at www.returnpath.net

Do you know of anyone that might be a good fit?
ReplyDelete
Replies
UnknownMay 12, 2011 at 8:33 AM
Dhruba,

Does it make sense to add slave nodes to a hadoop clustur running a job? Will the slave be able to participate in the current job?

Thanks.

Shridhar
ReplyDelete
Replies
Dhruba BorthakurMay 12, 2011 at 11:18 AM
@Sridhar: if u add machines to a running cluster, they will automatically be used to run tasks of currently running jobs.
ReplyDelete
Replies
Bharath RaviSeptember 14, 2011 at 5:59 PM
Hi Dhruba,

I just came across this list, as I was scanning for ideas on Hadoop to work on. I'm a newcomer to Hadoop, and am generally interested in improving scalability and "flexibility": the second point in the list especially interests me.

Is dynamic appending of input a requirement in real-world use cases? I was wondering why this is needed, when the extra input could become a second mapreduce job.
ReplyDelete
Replies
Dhruba BorthakurSeptember 14, 2011 at 6:13 PM
@Bharath: if you want to do realtime analytics, then new data is arriving even after the map-reduce job has been submitted. Currently, we will wait for the current mr job to finish and then submit a new job. Instead, it will be nice if we can add new inputsplits to the currently running mr jobs.. the end results will be available earlier
ReplyDelete
Replies
Bharath RaviSeptember 14, 2011 at 7:44 PM
@Dhruba, I see what you're saying now: this could essentially provide "streaming input" capability to MR. Sounds cool!

Any pointers on whether this is being attempted somewhere? From what I gathered from the community, it's not.

Thanks for the ideas!
ReplyDelete
Replies
Dhruba BorthakurSeptember 14, 2011 at 9:15 PM
@Bharath: I do not think anybody is attempting to do it. But there is a real need to do realtime analytics!
ReplyDelete
Replies
amindriOctober 17, 2011 at 6:59 PM
Hi Dhurba
I need to carry out a research for one year for my undergraduate program.
I'm interested about "Make map-reduce jobs work across data centers."

Has the hadoop community started ay work on this topic?
Also I need to perform a literature review on the topic from existing papers before commencing on the research. But I couldn't find much research papers regarding this topic. Any of your advices would be very valuable.
thanks :)
ReplyDelete
Replies
Dhruba BorthakurOctober 22, 2011 at 12:26 AM
@amindri: there is not much work/research on hadoop across data centers. I tried something called HighTide (https://issues.apache.org/jira/browse/HDFS-1432) and you can find the code at (https://github.com/facebook/hadoop-20/tree/master/src/hdfs/org/apache/hadoop/hdfs/server/hightidenode)

But HighTide is not yet in production in any of our clusters.
ReplyDelete
Replies
Wei YanOctober 25, 2011 at 10:17 AM
Hi, Dhruba

I have a question about the 2nd in the list. Is it a incremental computing problem? Can Percolator be used to solve this?
ReplyDelete
Replies
MadhaviNovember 30, 2011 at 6:29 AM
Hello Dhruba,
this is what you said earlier,there is not much work/research on hadoop across data centers, but when we talk about the data centers then they can be the nodes distributed in nature, then such type of research must have been done before, Isn't it?
Is it possible to study fault tolerant nature of Hadoop as a research topic? what else can be there?
your valuable suggestions are required on the same
ReplyDelete
Replies
Dhruba BorthakurNovember 30, 2011 at 11:16 AM
@Madhavi: Hadoop is by definition "fault tolerant". But one angle of research is to create a standard benchmark to measure "fault-tolerance". Then you can run it on verios versions of hadoop to figure out if its fault-tolerance is increasing or not.
ReplyDelete
Replies
MadhaviDecember 1, 2011 at 5:49 AM
Thanks a lot Dhruba. Even wanted to ask to how much level the security issues are handled. If the system is distributed in nature then with the help of replication the data is maintained, the data from the failed node can be recovered, pls throw some light on security measures on such nodes. what research can be done in this area?
Even wanted your elaboration on research areas you suggested :-
Ability to make a map-reduce job take new input splits even after a map-reduce job has already started.
ReplyDelete
Replies
shaileshJuly 27, 2012 at 10:24 AM
Hi, Dhruba

I am post graduate student.I am beginner in Hadoop and Mapreduce programming. I am pursuing post graduation in computer science and engg.I have selected load balancing in cloud as my major project. I have gone through http://wiki.apache.org/hadoop/EclipseEnvironment for hadoop development using eclipse.But getting problem. I am all the steps they mentioned.So if there some other way for hadoop development.
ReplyDelete
Replies
Walchand M.Tech I 2011September 11, 2012 at 4:11 AM
Hi, Dhruba
I am currently working on Hadoop Project as my M.Tech Study, Inspired from your Blog.
My current Project is "Accelerate Hadoop performance using Distributed Cache".

I am currently studying Hadoop code, but not getting the exact entry point that where I must add Caching code of mine so that Intermediate results will get cached and obviously performance can be increased.

Will you please suggest which classes need to modify.

Thanks
ReplyDelete
Replies
AnonymousOctober 20, 2012 at 2:21 AM
This is an excellent blog.

Facebook Applications Starting $39.99 ONLY!
ReplyDelete
Replies
Raja ThiruvathuruJanuary 8, 2013 at 12:38 AM
Wealth of information. Ton of ideas...nice blog
ReplyDelete
Replies
OVJJanuary 11, 2013 at 8:52 PM
Hi,

I have worked on Dynamic replication in our semester project. I wanted to submit that as a patch. I have replied to the original jira-782. But didn't get any reply. May I know what is the best way to communicate with hadoop developers?

Thank you,
Omkar
ReplyDelete
Replies
UnknownMarch 27, 2013 at 12:39 PM
I want to know if hdfs can tolerate byzantine faults?
ReplyDelete
Replies
Oodles TechnologiesMarch 31, 2013 at 11:13 PM
hello, you blog is a full of complate Hadoop Technology information. it is very Excellent blog ilke it.
ReplyDelete
Replies
UnknownJuly 7, 2013 at 6:36 AM
great post. for beginners do follow
http://thehadooptutorial.blogspot.com
ReplyDelete
Replies
Gaurav AsharaJanuary 22, 2014 at 9:40 AM
It really very use full guidance for any of Researcher or Hadoop Developer. Thank You

Actually i want work on hadoop security as my M.phil Research theses. I read many papers,articles lots work already done in this area can you plz suggest me any track on this area. Now I have only 1.5 Month of time for research work.

Can you Plz Help me...
ReplyDelete
Replies
UnknownMay 14, 2014 at 11:41 PM
Hi @Dhruba i want some project ideas for developing projects. Can you please share them. Thanks in advance.
ReplyDelete
Replies
AnonymousJune 18, 2014 at 11:38 AM
ha ha....Every software engineer wants to be on hadoop. When talking to these dumb technical managers even they keep saying future will be cloud, hadoop ...blah..blah...blah.... This seriously looks like that hype which we saw in 2001-2002 on EJB. For people whose business data is completely structured why do they have to go to Hadoop? I completely understand potential of Hadoop and non-sql database. The problem area which it addresses hardly 5% of the industry might be in it. A simple client-server request-response model is what is used everywhere.
ReplyDelete
Replies

Add comment

HDFS

Sunday, November 21, 2010

Hadoop Research Topics

44 comments:

Search This Blog

StatCounter

Followers