Sunday, November 21, 2010

Hadoop Research Topics

Recently, I visited a few premier educational institutes in India, e.g. Indian Institute of Technology (IIT) at Delhi and Guwahati. Most of the undergraduate students at these two institutes are somewhat familiar with Hadoop and would like to work on Hadoop related projects as part of their course work. One commonly asked question that I got from these students is what Hadoop feature can I work on?

Here are some items that I have in mind that are good topics for students to attempt if they want to work in Hadoop.
  • Ability to make Hadoop scheduler resource aware, especially CPU, memory and IO resources. The current implementation is based on statically configured slots.
  • Abilty to make a map-reduce job take new input splits even after a map-reduce job has already started.
  • Ability to dynamically increase replicas of data in HDFS based on access patterns. This is needed to handle hot-spots of data.
  • Ability to extend the map-reduce framework to be able to process data that resides partly in memory. One assumption of the current implementation is that the map-reduce framework is used to scan data that resides on disk devices. But memory on commodity machines is becoming larger and larger. A cluster of 3000 machines with 64 GB each can keep about 200TB of data in memory! It would be nice if the hadoop framework can support caching the hot set of data on the RAM of the tasktracker machines. Performance should increase dramatically because it is costly to serialize/compress data from the disk into memory for every query.
  • Heuristics to efficiently 'speculate' map-reduce tasks to help work around machines that are laggards. In the cloud, the biggest challenge for fault tolerance is not to handle failures but rather anomalies that makes parts of the cloud slow (but not fail completely), these impact performance of jobs.
  • Make map-reduce jobs work across data centers. In many cases, a single hadoop cluster cannot fit into a single data center and a user has to partition the dataset into two hadoop clusters in two different data centers.
  • High Availability of the JobTracker. In the current implementation, if the JobTracker machine dies, then all currently running jobs fail.
  • Ability to create snapshots in HDFS. The primary use of these snapshots is to retrieve a dataset that was erroneously modified/deleted by a buggy application.
The first thing for a student who wants to do any of these projects is to download the code from HDFS and MAPREDUCE. Then create an account in the bug tracking software here. Please search for an existing JIRA that describes your project; if none exists then please create a new JIRA. Then please write a design document proposal so that the greater Apache Hadoop community can deliberate on the proposal and post this document to the relevant JIRA.

If anybody else have any new project ideas, please add them as comments to this blog post.


  1. Hi,

    I recently wrote two emails on HDFS-dev about the same question. These were my suggestions:

    you could write a developer documentation of the inner workings of HDFS
    (+HBASE, +MAPREDUCE?) that could be understood by HDFS users. Additionally to
    the documentation of the current state, you could include:

    - Different strategies to make the NameNode distributed
    - The different Approaches to append
    - How does Security with Kerberos work?

    One of the challenges of such a work would be to make it as easy as possible
    for developers to understand some part of HDFS they're interested in.
    Another challenge is to choose a documentation format and workflow that would
    make it easy to keep this documentation current without much effort.

    A totally other project that I also consider important for Hadoop: Help Apache
    to implement an infrastructure based on GIT. This could help many projects in
    the long run. If you're interested in this, you should subscribe to
    infrastructure-dev(at) and get in contact with Jukka Zitting.

    - Yesterday I got hit by MAPREDUCE-1283[1]. This issue by itself is of course
    not enough for three months, but my idea is that you could in general have a
    look over the developer tools and what's missing, what needs improvement.

    - HBasene[2] is a project to store a lucene index natively on BigTable. It was
    inspired by Lucandra[3]. The HBasene project however has stalled. Still it
    would be a very promissing project IMHO especially considering the upcoming
    talk on Googles new Index infrastructure Percolator[4] that uses BigTable to
    store the index.

    - A backup system on top of HBase. It should try to store similar files near
    to each other so that tablet compression can work best. HBase's timestamps
    could be used to hold several versions of a file and let HBase handle the
    expiration of old versions of files.
    As an additional task you could evaluate the feasability of installing HBase
    as a backup system with office desktop computers as regionservers. This could
    utilize otherwise unused hard drive space.


  2. Hello Dhruba,

    I'd love to attack a specific one on your list -- Has work begun related to the third point feature yet? If it has been, could you guide me to the associated JIRA page so I may butt in and/or begin?

  3. Ok, I found it -- HDFS-782

  4. @Harsh, here are some more details:

  5. @ ppl,
    Could you suggest any good topic in hdfs that can be possibly done with in a semester.Thank you.

  6. Another project that can be done in a month or so:

    a distributed copy (similar to distcp, but instead of copying file-after-file, copy all blocks in parallel (using the new HDFS concat API and then stiching al the blocks together into appropriate files at the destination.

    You can discuss this in more detail via

  7. Thanks dhruba, It sounds good, I hope I can come up with something.

  8. Hello Dhruba,

    I wonder if the point five (Heuristics to efficiently 'speculate'...) is related with the issue or with . Is there any document which contains more info?

    Actually, I'm working in a propose to deliver high-availability in HDFS clusters, however I see you also are working on it ( . Is there any account where I can send you my propose and see if it could work with the point five?



  9. Quite good thoughts,
    I agree most of your ideas.
    In fact, we are developing one project of run Hadoop on multiple data centers. Some other ideas that might be interesting, like energy aware hadoop that reduce the max temperature of a hadoop cluster; provenance support for Hadoop.

  10. Another question, do you or any one can convince me that the lineally streaming copy from one block to another has advantages over alternative solutions?
    How about copy data from one block to another in a binary tree or binary fat tree like method?

  11. hi lizhe, i agree that there could be quite a few advantages to using a binary tree to do data copy.

  12. hi Dhruba,
    How do you plan to take advantage of in-memory cache reside in TaskTracker? There should be a lot of table joins among the offline analysis workloads in Facebook.

    If do joins on table A and B, like select * from A join B on = where ..., and B is an in-memory table, A is an on-disk (on-HDFS, more accurately) one. The MR job should scan both in-memory and on-disk tables during the map stage. In-memory scans should be finished quickly, however, they need to wait for the much slower on-disk scans. Hence, it doesn't significantly improve performance.

  13. Hi Min, a majority of our workload works on data (joins, filters, etc) on a subset of our data. For example, most queries are for the most recent day, followed by the most recent week. So, if we can put all the data for the last day (or last week) entirely in memory, then we can do any kind of analysis (filters, join, aggregate) purely on in-memory based data.

    On the other hand, if the entire dataset is not in memory, then the problems u point out are entirely likely to occur.

  14. This comment has been removed by the author.

  15. Hi Dhruba,

    Thanks for your reply. The workloads you described are very much like ours. We often scan data produced on the last day.

    Even if all of those data were warmed into Tasktracker's cache, however, I think there should be another problem. The dependency relations among the SQL queries are very common. So the intermediate tables, produced by the queries, are clearly much needed to be kept in-memory for gaining a better performance. Thus more memory footprint is needed. If not, the subsequent queries could't benefit from this mechanism.

  16. Do you have any performance metrics to evaluate whether an application is data intensive?
    or any good benchmark for Hadoop in addition to terasort?
    any Hadoop version of Graph computing?

  17. each map-reduce task reports the amount of CPU it has consumed as well as the amount of data it has processed. It is feasible to make the JT calculate the ratio of these two numbers... a lower ratio will indicate a more data-intensive program.

    I do not use any Hadoop code for general purpose graph computing.

  18. Hello Dhruba,

    Return Path is an award winning mid-size company located outside Boulder, Colorado. Our mission is to bring an end to messaging abuse by identifying, improving, and certifying all the legitimate email in the world. We work with some of the largest email senders in the world as well as the largest ISPs. We're looking for a Hadoop Sys Admin to join the team. Learn more and apply at

    Do you know of anyone that might be a good fit?

  19. Dhruba,

    Does it make sense to add slave nodes to a hadoop clustur running a job? Will the slave be able to participate in the current job?



  20. @Sridhar: if u add machines to a running cluster, they will automatically be used to run tasks of currently running jobs.

  21. Hi Dhruba,

    I just came across this list, as I was scanning for ideas on Hadoop to work on. I'm a newcomer to Hadoop, and am generally interested in improving scalability and "flexibility": the second point in the list especially interests me.

    Is dynamic appending of input a requirement in real-world use cases? I was wondering why this is needed, when the extra input could become a second mapreduce job.

  22. @Bharath: if you want to do realtime analytics, then new data is arriving even after the map-reduce job has been submitted. Currently, we will wait for the current mr job to finish and then submit a new job. Instead, it will be nice if we can add new inputsplits to the currently running mr jobs.. the end results will be available earlier

  23. @Dhruba, I see what you're saying now: this could essentially provide "streaming input" capability to MR. Sounds cool!

    Any pointers on whether this is being attempted somewhere? From what I gathered from the community, it's not.

    Thanks for the ideas!

  24. @Bharath: I do not think anybody is attempting to do it. But there is a real need to do realtime analytics!

  25. Hi Dhurba
    I need to carry out a research for one year for my undergraduate program.
    I'm interested about "Make map-reduce jobs work across data centers."

    Has the hadoop community started ay work on this topic?
    Also I need to perform a literature review on the topic from existing papers before commencing on the research. But I couldn't find much research papers regarding this topic. Any of your advices would be very valuable.
    thanks :)

  26. @amindri: there is not much work/research on hadoop across data centers. I tried something called HighTide ( and you can find the code at (

    But HighTide is not yet in production in any of our clusters.

  27. Hi, Dhruba

    I have a question about the 2nd in the list. Is it a incremental computing problem? Can Percolator be used to solve this?

  28. Hello Dhruba,
    this is what you said earlier,there is not much work/research on hadoop across data centers, but when we talk about the data centers then they can be the nodes distributed in nature, then such type of research must have been done before, Isn't it?
    Is it possible to study fault tolerant nature of Hadoop as a research topic? what else can be there?
    your valuable suggestions are required on the same

  29. @Madhavi: Hadoop is by definition "fault tolerant". But one angle of research is to create a standard benchmark to measure "fault-tolerance". Then you can run it on verios versions of hadoop to figure out if its fault-tolerance is increasing or not.

  30. Thanks a lot Dhruba. Even wanted to ask to how much level the security issues are handled. If the system is distributed in nature then with the help of replication the data is maintained, the data from the failed node can be recovered, pls throw some light on security measures on such nodes. what research can be done in this area?
    Even wanted your elaboration on research areas you suggested :-
    Ability to make a map-reduce job take new input splits even after a map-reduce job has already started.

  31. Hi, Dhruba

    I am post graduate student.I am beginner in Hadoop and Mapreduce programming. I am pursuing post graduation in computer science and engg.I have selected load balancing in cloud as my major project. I have gone through for hadoop development using eclipse.But getting problem. I am all the steps they mentioned.So if there some other way for hadoop development.

  32. Hi, Dhruba
    I am currently working on Hadoop Project as my M.Tech Study, Inspired from your Blog.
    My current Project is "Accelerate Hadoop performance using Distributed Cache".

    I am currently studying Hadoop code, but not getting the exact entry point that where I must add Caching code of mine so that Intermediate results will get cached and obviously performance can be increased.

    Will you please suggest which classes need to modify.


  33. Wealth of information. Ton of ideas...nice blog

  34. Hi,

    I have worked on Dynamic replication in our semester project. I wanted to submit that as a patch. I have replied to the original jira-782. But didn't get any reply. May I know what is the best way to communicate with hadoop developers?

    Thank you,

    1. to attend the hadoop summit...

  35. I want to know if hdfs can tolerate byzantine faults?

  36. hello, you blog is a full of complate Hadoop Technology information. it is very Excellent blog ilke it.

  37. great post. for beginners do follow

  38. Hadoop keeps track of where the data resides ,there are multiple copy stores, data stored on a server.
    Hadoop Development

  39. It really very use full guidance for any of Researcher or Hadoop Developer. Thank You

    Actually i want work on hadoop security as my M.phil Research theses. I read many papers,articles lots work already done in this area can you plz suggest me any track on this area. Now I have only 1.5 Month of time for research work.

    Can you Plz Help me...

  40. Hi @Dhruba i want some project ideas for developing projects. Can you please share them. Thanks in advance.

  41. ha ha....Every software engineer wants to be on hadoop. When talking to these dumb technical managers even they keep saying future will be cloud, hadoop ...blah..blah...blah.... This seriously looks like that hype which we saw in 2001-2002 on EJB. For people whose business data is completely structured why do they have to go to Hadoop? I completely understand potential of Hadoop and non-sql database. The problem area which it addresses hardly 5% of the industry might be in it. A simple client-server request-response model is what is used everywhere.