tag:blogger.com,1999:blog-85560037863248049732024-03-05T03:46:42.021-08:00HDFSThis are my musings related to the Apache Hadoop Project.Dhruba Borthakurhttp://www.blogger.com/profile/10832366855372649190noreply@blogger.comBlogger21125tag:blogger.com,1999:blog-8556003786324804973.post-33185423869907155902012-05-05T22:48:00.001-07:002012-05-05T22:49:23.375-07:00Hadoop and Solid State Drives<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
Is there a story for the Hadoop Storage Stack (HDFS+HBase) on <a href="http://en.wikipedia.org/wiki/Solid-state_drive">Solid State Drive (SSD)</a>? This is a question that I have been asked by quite a few people in the last two days, mostly by people at the <a href="http://opencompute.org/summit-2012/">OpenComputeSummit</a>. This piece discusses the possible use cases of using SSD with Hadoop or HBase.<br />
<br />
<b>Use Case</b><br />
Currently, there are two primary use cases for HDFS: <a href="http://borthakur.com/ftp/sigmodwarehouse2010.pdf">data warehousing</a> using map-reduce and a key-value store via <a href="http://hbase.apache.org/">HBase</a>. In the data warehouse case, data is mostly accessed sequentially from HDFS, thus there isn't much benefit from using a SSD to store data. In a data warehouse, a large portion of queries access only recent data, so one could argue that keeping the last few days of data on SSDs could make queries run faster. But most of our map-reduce jobs are CPU bound (decompression, deserialization, etc) and bottlenecked on map-output-fetch; reducing the data access time from HDFS does not impact the latency of a map-reduce job. Another use case would be to put map outputs on SSDs, this could potentially reduce map-output-fetch times, this is one option that needs some benchmarking.<br />
<br />
For the secone use-case, HDFS+HBase could theoretically use the full potential of the SSDs to make online-transaction-processing-workloads run faster. This is the use-case that the rest of this blog post tries to address.<br />
<br />
<b>Background</b><br />
The read/write latency of data from a SSD is a magnitude smaller than the read/write latency of a spinning disk storage, this is especially true for random reads and writes. For example, a random read from a SSD takes about 30 micro-seconds while a random read from a spinning disk takes 5 to 10 milliseconds. Also, a SSD device can support 100K to 200K operations/sec while a spinning disk controller can possibly issue only 200 to 300 ops/sec. This means that random reads/writes are not a bottleneck on SSDs. On the other hand, most of our existing database technology is designed to store data in spinning disks, so the natural question is "can these databases harness the full potential of the SSDs"? To answer the above question, we ran two separate artificial random-read workloads, one on HDFS and one on HBase. The goal was to stretch these products to the limit and establish their maximum sustainable throughput on SSDs.<br />
<div>
<br /></div>
<div>
<div>
<b>HDFS random-read on cached data</b></div>
<div>
In the first experiment, we created a HDFS cluster with a single NameNode and a single DataNode. We created a 2 GB HDFS file with a HDFS block size of 256 MB and a replication factor of 1. We configured the DataNode to run on a 16 hyper-threaded cores and it stored block-data on xfs. Our benchmark program was co-located on the DataNode machine and had hdfs-read-shortcircuit swicthed on, i.e. the DFSClient bypassed the DataNode and issued read-calls directly to the local xfs filesystem. The entire 2 GB of data was cached in the OS buffer cache and this benchmark did not trigger any IO from disk. The fact that all the data was in the OS cache essentially simulated the behavior of an ultra-fast SSD. We varied the number of client threads and each client thread did a pre-configured number of 16K read calls from HDFS. Since there were only 8 blocks in the file, the DFSClient cached all the block locations of all these 8 blocks and there were no repeatative calls to the NameNode. The first few iterations of this test showed that HDFS can sustain a max random-read-throughput of around 50K ops/sec, but surprisingly the CPU was not maxed out. We found that the read-shortcircuit code path spent considerable time in DNS lookup calls and updating metric-counters. We fixed these two pieces of code and observed that HDFS could sustain a peak random-read-throughput of around 92K ops/sec, the CPUs was now close to 95% usage. <a href="http://borthakur.com/ftp/HdfsPreadImage.jpg">HdfsPreadImage</a> is a plot that captures this scenario. The takeaway is that a database that is layered above HDFS would not be able to utilize all the iops offered by a single SSD.</div>
<div>
<br /></div>
<div>
A profiled run of the HDFS code shows that the DFSClient's code path are quite long and causes appreciable impact to throughput for cached random reads. If data-fetch times are in the millisecond range(from spinning disks), the long code paths in the DFSClient do not add appreciable overhead, but when the data is cached in the OS cache (or in SSDs), these code paths need some major rework. Another option would be to write a HDFS readonly-client in C or C++, thereby avoiding some of the overhead of the current Java-based DFSClient.</div>
<div>
<br /></div>
<div>
<b>HBase random-get on cached data</b></div>
<div>
In the second experiment, we did a similar experiment on HBase. We created a single table with a single region and all data was cached in the OS cache of a single HBase regionserver. The OS cache is simulating a super fast SSD device. We used a set of 4 client machines to drive random-get calls to the regionserver. The regionserver was configured to use a maximum of 2000 threads. The HBase table has lzo compression and delta-encoding-fast-diff enabled. Since the data set is cached in OS buffers, this benchmark does not cause any disk io from spinning disks. We saw that the HBase throughput maxes out at around 35K ops/sec and we were not able to drive the CPU usage on that machine to more than 45%. Heavy lock contention and heavy context switching causes the regionserver to not be able to use all the available CPU on the machine. The detailed chart is at <a href="http://borthakur.com/ftp/Cache4G.jpg">Cache4G</a>.</div>
</div>
<div>
<br /></div>
<div>
<div>
<b>What does this mean</b></div>
<div>
The two experiments show that HBase+HDFS, as it stands today, will not be able to harness the full potential that is offered by SSDs. It is possible that some code restructuring could improve the random-read-throughput of these solutions but my guess is that it will need significant engineering time to make HBase+HDFS sustain a throughput of 200K ops/sec.</div>
</div>
<div>
<br /></div>
<div>
<div>
These results are not unique to HBase+HDFS. Experiments on other non-Hadoop databases show that they also need to be re-engineered to achieve SSD-capable throughputs. My conclusion is that database and storage technologies would need to be developed from scratch if we want to utilize the full potential of Solid State Devices. The search is on for there new technologies!</div>
</div>
<div>
<br /></div>
<div>
<br /></div>
</div>Dhruba Borthakurhttp://www.blogger.com/profile/10832366855372649190noreply@blogger.com31tag:blogger.com,1999:blog-8556003786324804973.post-52114989446942138002012-02-19T21:21:00.002-08:002012-02-20T10:39:57.091-08:00Salient features for a BigData Benchmark<div><span>Recently, I was asked to write up about my vision of a BigData Benchmark. That begs the question: <i>What is BigData</i>? Does it refer to a dataset that is large in size, and if so, what is <i>large</i>? Does it refer to the type of data that such a data store contains? Shall we refer to BigData only if it does not conform to a relational schema? Here are some of my random thoughts.</span></div><div><span><br /></span></div><div><span>Software industry professionals have started to use the term BigData to refer to data sets that are typically many magnitudes larger than traditional databases. The largest Oracle database or the largest NetApp filer could be many hundred terabytes at most, but BigData refers to storage sets that can scale to many hundred petabytes. Thus, the first and foremost chracteristics of a BigData store is that a single instance of it can be <a href="http://hadoopblog.blogspot.com/2010/05/facebook-has-worlds-largest-hadoop.html">many petabytes</a> in size. These data stores can have a multitude of interfaces, starting from traditional <a href="http://hive.apache.org/">SQL-like</a> queries to customized <a href="http://hbase.apache.org/">key-value </a>access methods. Some of them are batch systems while others are interactive systems. Again, some of them are organized for full-scan-index-free access while others have fine-grain indexes and low latency access. How can we design a benchmark(s) for such a wide variety of data stores? Most benchmarks focus on latency and throughput of queries, and rightly so. However, in my opinion, the key to designing a BigData benchmark lies in understanding the deeper commonalities of these systems. A BigData benchmark should measure latencies and throughput, but with a great deal of variations in the workload, skews in the data set and in the presence of faults. </span><span style="font-size: 100%; ">I list below some of the common characteristics that distinguish BigData installations from other data storage systems.</span></div><div><span><br /></span></div><div><span><b>Elasticity of resources</b></span></div><div><span>A primary feature of a BigData System is that it should be elastic in nature. One should be able to add software and hardware resources when needed. Most BigData installations do not want to pre-provision for all the data that they might collect in the future, and the trick to be cost-efficient is to be able to add resources to a production store without incurring downtime. A BigData system typically has the ability to decommission parts of </span><span>the hardware and software without off-lining the service, so that obselete or defective hardware can be repl</span><span style="font-family: Georgia, serif; ">aced dynamically. In my mind, this is one of the most important features of a BigData system, thus a benchmark </span><span style="font-family: Georgia, serif; ">should be able to measure this feature. The benchmark should be such that we can add and re</span><span style="font-family: Georgia, serif; ">move resources to the system when the benchmark is concurrently executing.</span></div><div><span><br /></span></div><div><span><b>Fault Tolerance</b></span></div><div><span>The Elasticity feature described above indirectly implies that the system has to be fault-tolerant. If a workload is running on your system and some parts of the system fails, the other parts of the </span><span style="font-family: Georgia, serif; ">system should configure themselves to share the work of the failed parts. This means that the service does not fail even in the face of some component failures. The benchmark should me</span><span style="font-family: Georgia, serif; ">asure this aspect of BigData systems. One simple option could be that the benchmark itself introduces component failures as part of its execution.</span></div><div><span><br /></span></div><div><span><div><b>Skew in the data set</b></div><div>Many big data systems take in un-curated data. That means there are always data points that are extreme outliers and introduces hotspots in the system. The workload on a BigData system is not uniform; some small parts of it is are major <i>hotspots</i> and incur tremendously higher load than the rest of the system. Our benchmarks should be designed to operate on datasets that have large skew and introduce workload <i>hotspots</i>.</div><div><br /></div><div>There are a few previous attempts to define a unified benchmark for BigData. Dewitt and Stonebraker touched upon a few areas in their <a href="http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf">SIGMOD paper</a>. They describe experiments that use a grep task, a join task and a simple sql aggregation query. But none of those experiments are done in the presence of system faults, neither do they add or remove hardware when the experiment is in progress. Similarly, the <a href="http://research.yahoo.com/node/3202">YCSB benchmark</a> proposed by Cooper and Ramakrishnan suffers from the same deficiency.</div><div><br /></div><div>How would I run the experiments proposed by Dewitt and Stonebraker? Here are some of my early thoughts:</div><div><ul><li>Focus on a 100 node experiment only. This is the setting that is appropriate for BigData systems.</li><li>Increase the number of URLs such that the data set is at least a few hundred terabytes.</li><li>Make the benchmark run for at least an hour or so. The workload should be a set of multiple queries. Pace the workload so that the there is constant fluctuations in the number of inflight queries.</li><li>Introduce skew in the data set. The URL data should be such that maybe 0.1% of those URLs occur 1000 times more frequently that other URLs.</li><li>Introduce system faults by killing one of the 100 nodes once every minute, keep it shutdown for a minute, then bring it back online and then continue with process with the remainder of the nodes till the entire benchmark is done.</li></ul></div><div>My hope is that there is somebody out there who can repeat the experiments with the modified settings listed above and present their findings. This research would greatly benefit the BigData community of users and developers!</div><div><br /></div><div>On a side note, I am working with some of my esteemed colleagues to document a specific data model and custom workload for online serving of queries from a multi-petabyte BigData system. I will write about it in greater detail in a future post.</div></span></div>Dhruba Borthakurhttp://www.blogger.com/profile/10832366855372649190noreply@blogger.com8tag:blogger.com,1999:blog-8556003786324804973.post-76814038675786767652011-07-02T18:02:00.000-07:002011-07-02T18:32:40.480-07:00Realtime Hadoop usage at Facebook: The Complete StoryI had earlier blogged about <a href="http://hadoopblog.blogspot.com/2011/05/realtime-hadoop-usage-at-facebook-part.html">why</a> Facebook is starting to use Apache Hadoop technologies to serve realtime <a href="http://hadoopblog.blogspot.com/2011/05/realtime-hadoop-usage-at-facebook-part_28.html">workloads</a>. We presented the paper at the <a href="http://www.sigmod2011.org/ind_list.shtml">SIGMOD 2011</a> conference and it was very well received. <div><br /></div><div>Here is a <a href="http://borthakur.com/ftp/RealtimeHadoopSigmod2011.pdf">link</a> to the complete paper for those who are interested in understanding the details of why we decided to use Hadoop technologies, the workloads that we have on realtime Hadoop, the enhancements that we did to Hadoop for supporting our workloads and the processes and methodologies we have adopted to deploy these workloads successfully. A shortened version of the first two sections of the paper are also described in the <a href="http://borthakur.com/ftp/SIGMODRealtimeHadoopPresentation.pdf">slides</a> that you can find <a href="http://borthakur.com/ftp/SIGMODRealtimeHadoopPresentation.pdf">here</a>.</div>Dhruba Borthakurhttp://www.blogger.com/profile/10832366855372649190noreply@blogger.com20tag:blogger.com,1999:blog-8556003786324804973.post-376297426937752362011-05-28T00:56:00.000-07:002011-06-02T22:29:51.221-07:00Realtime Hadoop usage at Facebook -- Part 2 - Workload TypesThis is the second part of our <a href="http://www.sigmod2011.org/ind_list.shtml">SIGMOD-2011</a> paper that describes our use case for <a href="http://hadoop.apache.org/hdfs/">Apache Hadoop</a> and <a href="http://hbase.apache.org/">Apache HBase</a> in realtime workloads. You can find the <a href="http://hadoopblog.blogspot.com/2011/05/realtime-hadoop-usage-at-facebook-part.html">first part here</a>. We describe why Hadoop and HBase fits the requirements of each of these applications.
<br /><span><span>
<br /><b><font class="Apple-style-span" size="5">OUR WORKLOADS</font></b></span></span><div><span><span>
<br /></span></span><div><span><span><span><span>Before deciding on a particular software stack and whether or not to move away from our MySQL-based architecture, we looked at a few specific applications where existing solutions may be problematic. These use cases would have workloads that are challenging to scale because of very high write throughput, massive datasets, unpredictable growth, or other patterns that may be difficult or suboptimal in a sharded RDBMS environment.</span></span></span></span></div><div><span><span><span><span>
<br /><b><font class="Apple-style-span" size="5">1. Facebook Messaging</font></b></span></span></span></span></div><div><span><span><span><span><b><font class="Apple-style-span" size="4">
<br /></font></b>The latest generation of Facebook Messaging combines existing Facebook messages with e-mail, chat, and SMS. In addition to persisting all of these messages, a new threading model also requires messages to be stored for each participating user. As part of the application server requirements, each user will be sticky to a single data center.
<br />
<br /></span></span></span></span></div><div><span><span><span><span><b>1.1 High Write Throughput</b></span></span></span></span></div><div><span><span><span><span>With an existing rate of millions of messages and billions of instant messages every day, the volume of ingested data would be very large from day one and only continue to grow. The denormalized requirement would further increase the number of writes to the system as each message could be written several times.</span></span></span></span></div><div><span><span><span><span>
<br /><b>1.2 Large Tables
<br /></b>As part of the product requirements, messages would not be deleted unless explicitly done so by the user, so each mailbox would grow indefinitely. As is typical with most messaging applications, messages are read only a handful of times when they are recent, and then are rarely looked at again. As such, a vast majority would not be read from the database but must be available at all times and with low latency, so archiving would be difficult. Storing all of a user’’s thousands of messages meant that we’’d have a database schema that was indexed by the user with an ever-growing list of threads and messages. With this type of random write workload, write performance will typically degrade in a system like MySQL as the number of rows in the table increases. The sheer number of new messages would also mean a heavy write workload, which could translate to a high number of random IO operations in this type of system.</span></span></span></span></div><div><span><span><span><span>
<br /><b>1.3 Data Migration
<br /></b>One of the most challenging aspects of the new Messaging product was the new data model. This meant that all existing user’’s messages needed to be manipulated and joined for the new threading paradigm and then migrated to the new system. The ability to perform large scans, random access, and fast bulk imports would help to reduce the time spent migrating users to the new system.</span></span></span></span></div><div><span><span><span><span>
<br /><font class="Apple-style-span" size="5"><b>2 Facebook Insight</b>s</font></span></span></span></span></div><div><span><span><span><span><font class="Apple-style-span" size="4">
<br /></font>Facebook Insights provides developers and website owners with access to real-time analytics related to Facebook activity across websites with social plugins, Facebook Pages, and Facebook Ads. Using anonymized data, Facebook surfaces activity such as impressions, click through rates and website visits. These analytics can help everyone from businesses to bloggers gain insights into how people are interacting with their content so they can optimize their services. Domain and URL analytics were previously generated in a periodic, offline fashion through our Hadoop and Hive analytics data warehouse. However, this does not yield a rich user experience as the data is only available several hours after it has occurred.</span></span></span></span></div><div><span><span><span><span>
<br /><font class="Apple-style-span" size="3"><b>2.1 Realtime Analytics
<br /></b></font>The insights teams wanted to make statistics available to their users within seconds of user actions rather than the hours previously supported. This would require a large-scale, asynchronous queuing system for user actions as well as systems to process, aggregate, and persist these events. All of these systems need to be fault-tolerant and support more than a million events per second.</span></span></span></span></div><div><span><span><span><span>
<br /><b>2.2 High Throughput Increments
<br /></b>To support the existing insights functionality, time and demographic-based aggregations would be necessary. However, these aggregations must be kept up-to-date and thus processed on the fly, one event at a time, through numeric counters. With millions of unique aggregates and billions of events, this meant a very large number of counters with an even larger number of operations against them.</span></span></span></span></div><div><span><span><span><span>
<br /><b><font class="Apple-style-span" size="5">3. Facebook Metrics System</font></b></span></span></span></span></div><div><span><span><span><span><font class="Apple-style-span" size="4"><b>
<br /></b></font>At Facebook, all hardware and software feed statistics into a metrics collection system called ODS (Operations Data Store). For example, we may collect the amount of CPU usage on a given server or tier of servers, or we may track the number of write operations to an HBase cluster. For each node or group of nodes we track hundreds or thousands of different metrics, and engineers will ask to plot them over time at various granularities. While this application has hefty requirements for write throughput, some of the bigger pain points with the existing MySQL-based system are around the resharding of data and the ability to do table scans for analysis and time roll-ups. This use-case is gearing up to be in production very shortly.
<br />
<br /></span></span></span></span></div><div><span><span><span><span><b>3.1 Automatic Sharding</b></span></span></span></span></div><div><span><span><span><span>The massive number of indexed and time-series writes and the unpredictable growth patterns are difficult to reconcile on a sharded MySQL setup. For example, a given product may only collect ten metrics over a long period of time, but following a large rollout or product launch, the same product may produce thousands of metrics. With the existing system, a single MySQL server may suddenly be handling much more load than it can handle, forcing the team to manually re-shard data from this server onto multiple servers.</span></span></span></span></div><div><span><span><span><span>
<br /><b>3.2 Fast Reads of Recent Data and Table Scans
<br /></b>A vast majority of reads to the metrics system is for very recent, raw data, however all historical data must also be available. Recently written data should be available quickly, but the entire dataset will also be periodically scanned in order to perform time- based rollups.</span></span></span></span></div></div><div><span><span><span><span>
<br /></span></span></span></span></div><div><span><span><span><span><meta charset="utf-8"><span class="Apple-style-span" style="color: rgb(102, 102, 102); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 14px; line-height: 15px; ">(<i>Credit to the authors of the paper: Dhruba Borthakur Kannan Muthukkaruppan Karthik Ranganathan Samuel Rash Joydeep Sen Sarma Jonathan Gray Nicolas Spiegelberg Hairong Kuang Dmytro Molkov Aravind Menon Rodrigo Schmidt Amitanand Aiyer)</i></span></span></span></span></span></div>Dhruba Borthakurhttp://www.blogger.com/profile/10832366855372649190noreply@blogger.com8tag:blogger.com,1999:blog-8556003786324804973.post-87612207876135196472011-05-17T08:50:00.000-07:002011-05-18T21:39:20.122-07:00Realtime Hadoop usage at Facebook -- Part 1<span class="Apple-style-span"></span><span><span><span class="Apple-style-span" >Facebook recently deployed Facebook Messages, its first ever user-facing application built on the Apache Hadoop platform. It uses <a href="http://hadoop.apache.org/hdfs/">HDFS</a> and <a href="http://hbase.apache.org/">HBase</a> as core technologies for this solution. Since then, there are many more applications that have started to used HBase. We have gained some experience in deploying and operating HDFS and HBase at peta-byte scale for realtime-workloads and decided to write a paper detailing some of these insights. This paper will be published in SIGMOD 2011. </span></span></span><div><span><span><span class="Apple-style-span" >
<br /></span></span></span></div><div><span><span><span class="Apple-style-span" >You can find the full paper here later, but here are some highlights:
<br />
<br /></span></span></span></div><div><span><span><span class="Apple-style-span" ><b>WHY HADOOP AND HBASE</b></span></span></span></div><div><span><span><span class="Apple-style-span" ><b>
<br /></b> The requirements for the storage system for our workloads can be summarized as follows:</span></span></span></div><div><span><span><span class="Apple-style-span" >
<br /><b>1. Elasticity</b>: We need to be able to add incremental capacity to our storage systems with minimal overhead and no downtime. In some cases we may want to add capacity rapidly and the system should automatically balance load and utilization across new hardware.</span></span></span></div><div><span><span><span class="Apple-style-span" >
<br /><b>2. High write throughput</b>: Most of the applications store (and optionally index) tremendous amounts of data and require high aggregate write throughput.</span></span></span></div><div><span><span><span class="Apple-style-span" >
<br /><b>3. Efficient and low-latency strong consistency semantics within a data center:</b> There are important applications like Messages that require strong consistency within a data center. This requirement often arises directly from user expectations. For example ‘‘unread’’ message counts displayed on the home page and the messages shown in the inbox page view should be consistent with respect to each other. While a globally distributed strongly consistent system is practically impossible, a system that could at least provide strong consistency within a data center would make it possible to provide a good user experience. We also knew that (unlike other Facebook applications), Messages was easy to federate so that a particular user could be served entirely out of a single data center making strong consistency within a single data center a critical requirement for the Messages project. Similarly, other projects, like realtime log aggregation, may be deployed entirely within one data center and are much easier to program if the system provides strong consistency guarantees.</span></span></span></div><div><span><span><span class="Apple-style-span" >
<br /><b>4. Efficient random reads from disk</b>: In spite of the widespread use of application level caches (whether embedded or via memcached), at Facebook scale, a lot of accesses miss the cache and hit the back-end storage system. MySQL is very efficient at performing random reads from disk and any new system would have to be comparable.</span></span></span></div><div><span><span><span class="Apple-style-span" >
<br /><b>5. High Availability and Disaster Recovery:</b> We need to provide a service with very high uptime to users that covers both planned and unplanned events (examples of the former being events like software upgrades and addition of hardware/capacity and the latter exemplified by failures of hardware components). We also need to be able to tolerate the loss of a data center with minimal data loss and be able to serve data out of another data center in a reasonable time frame.
<br /><b>6. Fault Isolation:</b> Our long experience running large farms of MySQL databases has shown us that fault isolation is critical. Individual databases can and do go down, but only a small fraction of users are affected by any such event. Similarly, in our warehouse usage of Hadoop, individual disk failures affect only a small part of the data and the system quickly recovers from such faults.</span></span></span></div><div><span><span><span class="Apple-style-span" >
<br /><b>7. Atomic read-modify-write primitives: </b>Atomic increments and compare-and-swap APIs have been very useful in building lockless concurrent applications and are a must have from the underlying storage system.</span></span></span></div><div><span><span><span class="Apple-style-span" >
<br /><b>8. Range Scans:</b> Several applications require efficient retrieval of a set of rows in a particular range. For example all the last 100 messages for a given user or the hourly impression counts over the last 24 hours for a given advertiser.
<br />
<br /></span></span></span></div><div><span><span><span class="Apple-style-span" >It is also worth pointing out non-requirements:
<br />
<br /></span></span></span></div><div><span><span><span class="Apple-style-span" ><b>1. Tolerance of network partitions within a single data center</b>: Different system components are often inherently centralized. For example, MySQL servers may all be located within a few racks, and network partitions within a data center would cause major loss in serving capabilities therein. Hence every effort is made to eliminate the possibility of such events at the hardware level by having a highly redundant network design. </span></span></span></div><div><span><span><span class="Apple-style-span" ><b>
<br /></b></span></span></span></div><div><span><span><span class="Apple-style-span" ><b>2. Zero Downtime in case of individual data center failure:</b> In our experience such failures are very rare, though not impossible. In a less than ideal world where the choice of system design boils down to the choice of compromises that are acceptable, this is one compromise that we are willing to make given the low occurrence rate of such events. We might revise this non-requirement at a later time.</span></span></span></div><div><span><span><span class="Apple-style-span" ><b>
<br /></b></span></span></span></div><div><span><span><span class="Apple-style-span" ><b>3. Active-active serving capability across different data centers</b>: As mentioned before, we were comfortable making the assumption that user data could be federated across different data centers (based ideally on user locality). Latency (when user and data locality did not match up) could be masked by using an application cache close to the user. </span></span></span></div><div><span><span><span class="Apple-style-span" >
<br />Some less tangible factors were also at work. Systems with existing production experience for Facebook and in-house expertise were greatly preferred. When considering open-source projects, the strength of the community was an important factor. Given the level of engineering investment in building and maintaining systems like these –– it also made sense to choose a solution that was broadly applicable (rather than adopt point solutions based on differing architecture and codebases for each workload).
<br />
<br /></span></span></span></div><div><span><span><span class="Apple-style-span" >After considerable research and experimentation, we chose Hadoop and HBase as the foundational storage technology for these next generation applications. The decision was based on the state of HBase at the point of evaluation as well as our confidence in addressing the features that were lacking at that point via in- house engineering. HBase already provided a highly consistent, high write-throughput key-value store. The HDFS NameNode stood out as a central point of failure, but we were confident that our HDFS team could build a highly-available NameNode (<a href="http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html">AvatarNode</a>) in a reasonable time-frame, and this would be useful for our warehouse operations as well. Good disk read-efficiency seemed to be within striking reach (pending adding Bloom filters to HBase’’s version of LSM Trees, making local DataNode reads efficient and caching NameNode metadata). Based on our experience operating the Hive/Hadoop warehouse, we knew HDFS was stellar in tolerating and isolating faults in the disk subsystem. The failure of entire large HBase/HDFS clusters was a scenario that ran against the goal of fault-isolation, but could be considerably mitigated by storing data in smaller HBase clusters. Wide area replication projects, both in-house and within the HBase community, seemed to provide a promising path to achieving disaster recovery.
<br />
<br /></span></span></span></div><div><span><span><span class="Apple-style-span" >HBase is massively scalable and delivers fast random writes as well as random and streaming reads. It also provides row-level atomicity guarantees, but no native cross-row transactional support. From a data model perspective, column-orientation gives extreme flexibility in storing data and wide rows allow the creation of billions of indexed values within a single table. HBase is ideal for workloads that are write-intensive, need to maintain a large amount of data, large indices, and maintain the flexibility to scale out quickly.
<br />
<br /></span></span></span></div><div><span><span><span class="Apple-style-span" >HBase is now being used by many other workloads internally at Facebook . I will describe these different workloads in a later post.
<br />
<br /></span></span></span></div><div><span><span><span class="Apple-style-span" >(<i>Credit to the authors of the paper: Dhruba Borthakur Kannan Muthukkaruppan Karthik Ranganathan Samuel Rash Joydeep Sen Sarma Jonathan Gray Nicolas Spiegelberg Hairong Kuang Dmytro Molkov Aravind Menon Rodrigo Schmidt Amitanand Aiyer)</i></span></span></span><p class="p1"><span class="Apple-style-span"><i></i></span></p><meta equiv="Content-Type" content="text/html; charset=UTF-8"> <meta equiv="Content-Style-Type" content="text/css"> <title></title> <meta name="Generator" content="Cocoa HTML Writer"> <meta name="CocoaVersion" content="1038.32"> <style type="text/css"> p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Arial} </style> <meta equiv="Content-Type" content="text/html; charset=UTF-8"> <meta equiv="Content-Style-Type" content="text/css"> <title></title> <meta name="Generator" content="Cocoa HTML Writer"> <meta name="CocoaVersion" content="1038.32"> <style type="text/css"> p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Arial} </style> </div>Dhruba Borthakurhttp://www.blogger.com/profile/10832366855372649190noreply@blogger.com20tag:blogger.com,1999:blog-8556003786324804973.post-57843268878526419002011-04-14T00:15:00.001-07:002011-04-14T00:27:01.775-07:00Data warehousing at Facebook<meta charset="utf-8"><span class="Apple-style-span"><div style="color: rgb(102, 102, 102); "><span class="Apple-style-span">Many people have asked me to describe the <i>best practices </i>that we have adopted to run a multi PB data warehouse using Hadoop. Most of the details were described in a <span class="Apple-style-span"><a href="http://borthakur.com/ftp/sigmodwarehouse2010.pdf">paper</a></span> that we presented at SIGMOD 2010. This document refers to our state-of-affairs as it was about a year back, but is still an interesting read. Below is the abstract of this paper. You can find the <a href="http://borthakur.com/ftp/sigmodwarehouse2010.pdf">complete paper here.</a></span></div><div style="color: rgb(102, 102, 102); "><span class="Apple-style-span">
<br /></span></div><div style="color: rgb(102, 102, 102); "><span class="Apple-style-span">ABSTRACT</span></div><div style="color: rgb(102, 102, 102); "><span class="Apple-style-span" ><meta charset="utf-8"><span class="Apple-style-span" style="color: rgb(51, 51, 51); font-family: 'Times New Roman'; ">Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook - both engineering and non- engineering. Apart from ad hoc analysis of data and creation of business intelligence dashboards by analysts across the company, a number of Facebook's site features are also based on analyzing large data sets. These features range from simple reporting applications like Insights for the Facebook Advertisers, to more advanced kinds such as friend recommendations. In order to support this diversity of use cases on the ever increasing amount of data, a flexible infrastructure that scales up in a cost effective manner, is critical. We have leveraged, authored and contributed to a number of open source technologies in order to address these requirements at Facebook. These include Scribe, Hadoop and Hive which together form the cornerstones of the log collection, storage and analytics infrastructure at Facebook. In this paper we will present how these systems have come together and enabled us to implement a data warehouse that stores more than 15PB of data (2.5PB after compression) and loads more than 60TB of new data (10TB after compression) every day. We discuss the motivations behind our design choices, the capabilities of this solution, the challenges that we face in day today operations and future capabilities and improvements that we are working on.</span></span></div><div><span class="Apple-style-span" >
<br /></span></div><div style="color: rgb(102, 102, 102); ">Facebook has opensourced the version of Apache Hadoop that we use to power our production clusters. You can find more details about our usage of Hadoop at the <a href="http://www.facebook.com/notes/facebook-engineering/looking-at-the-code-behind-our-three-uses-of-apache-hadoop/468211193919">Facebook Engineering Blog</a>.</div></span>Dhruba Borthakurhttp://www.blogger.com/profile/10832366855372649190noreply@blogger.com5tag:blogger.com,1999:blog-8556003786324804973.post-28055493852717208212010-11-21T08:22:00.000-08:002010-11-21T08:49:05.123-08:00Hadoop Research TopicsRecently, I visited a few premier educational institutes in India, e.g. Indian Institute of Technology (IIT) at Delhi and Guwahati. Most of the undergraduate students at these two institutes are somewhat familiar with Hadoop and would like to work on Hadoop related projects as part of their course work. One commonly asked question that I got from these students is <i>what Hadoop feature can I work on?</i><div><br /></div><div>Here are some items that I have in mind that are good topics for students to attempt if they want to work in Hadoop.<div><div><ul><li><span class="Apple-style-span" style="font-size: 15.8333px; ">Ability to make Hadoop scheduler resource aware, especially CPU, memory and IO resources. The current implementation is based on statically configured <i>slots.</i></span></li><li><span class="Apple-style-span" style="font-size: 15.8333px; ">Abilty to make a map-reduce job take new input splits even after a map-reduce job has already started.</span></li><li><span class="Apple-style-span" style="font-size: 15.8333px; ">Ability to dynamically increase replicas of data in HDFS based on access patterns. This is needed to handle hot-spots of data.</span></li><li><span class="Apple-style-span" style="font-size: 15.8333px; ">Ability to extend the map-reduce framework to be able to process data that resides partly in memory. One assumption of the current implementation is that the map-reduce framework is used to scan data that resides on disk devices. But memory on commodity machines is becoming larger and larger. A cluster of 3000 machines with 64 GB each can keep about 200TB of data in memory! It would be nice if the hadoop framework can support caching the hot set of data on the RAM of the tasktracker machines. Performance should increase dramatically because it is costly to serialize/compress data from the disk into memory for every query.</span></li><li><span class="Apple-style-span" style="font-size: 15.8333px; ">Heuristics to efficiently 'speculate' map-reduce tasks to help work around machines that are laggards. In the cloud, the biggest challenge for fault tolerance is not to handle failures but rather anomalies that makes parts of the cloud slow (but not fail completely), these impact performance of jobs.</span></li><li><span class="Apple-style-span" style="font-size: 15.8333px; "></span><span class="Apple-style-span" style="font-size: 15.8333px; ">Make map-reduce jobs work across data centers. In many cases, a single hadoop cluster cannot fit into a single data center and a user has to partition the dataset into two hadoop clusters in two different data centers.</span></li><li><span class="Apple-style-span" style="font-size: 15.8333px; "></span><span class="Apple-style-span" style="font-size: 15.8333px; ">High Availability of the JobTracker. In the current implementation, if the JobTracker machine dies, then all currently running jobs fail.</span></li><li><span class="Apple-style-span" style="font-size: 15.8333px; ">Ability to create snapshots in HDFS. The primary use of these snapshots is to retrieve a dataset that was erroneously modified/deleted by a buggy application.</span></li></ul><div>The first thing for a student who wants to do any of these projects is to download the code from <a href="http://hadoop.apache.org/hdfs/version_control.html">HDFS</a> and <a href="http://hadoop.apache.org/mapreduce/version_control.html">MAPREDUCE</a>. Then create an account in the <a href="http://hadoop.apache.org/hdfs/issue_tracking.html">bug tracking</a> software here. Please search for an existing JIRA that describes your project; if none exists then please create a new JIRA. Then please write a design document proposal so that the greater Apache Hadoop community can deliberate on the proposal and post this document to the relevant JIRA.</div><div><br /></div><div>If anybody else have any new project ideas, please add them as comments to this blog post.</div></div></div></div>Dhruba Borthakurhttp://www.blogger.com/profile/10832366855372649190noreply@blogger.com44tag:blogger.com,1999:blog-8556003786324804973.post-87485355859795927372010-06-29T22:50:00.001-07:002010-06-30T00:55:53.000-07:00America’s Most Wanted – a metric to detect faulty machines in Hadoop<div style="text-align: center;"><br /></div><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhUakuzetb45-8Ntkk-d5piSzAh7elspEG5uQx88aoU13myt44i5CxrlDyS1MO_DTQ_5jWmR3Bui6BksOvHUFzmmJ9NS49kvUVxoXOV0vKSAPNZmx55eHy0Uf8cy50GRg6OWDHleIShQGU/s1600/chart1.jpg"></a><b>Handling failures in Hadoop</b><div><b></b><span class="Apple-style-span" style=" ;font-size:15.8333px;"><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">I have been asked many many questions about the failure rates of machines in our Hadoop cluster. These questions vary from the innocuou</span></span><i><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">s how much time do you spend everyday fixing bad machines in your cluster</span></span></i><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"> to the more involved ones like </span></span><i><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">does failure of hadoop machines depend on the heat-map of the data center</span></span></i><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">? I do not have answer to these questions and this is an area of research that has been of focus lately. But I have seen that the efficiency of a Hadoop cluster is directly dependent on the amount of manpower needed to operate such a cluster.</span></span></span><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"><br /></span></span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"><b>Common Types of Failures </b></span></span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">The three most common categories of failures I have observed are</span></span></div><div><div><ul><li><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">System Errors – Hardware, OS, jvm, hadoop, compiler, etc – Hadoop aims to reduce the effect of this broad category of errors</span></span></li><li><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">User Application Errors – Bad code written by an user – Bloated memory usage</span></span></li><li><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">Anomalous Behaviour -- </span></span><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">Not working according to expectation – Slow nodes – Causes most harm to Hadoop cluster because they go undetected for long periods of time</span></span></li></ul></div></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"><b>America's Most Wanted (AMW)</b></span></span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">I have the previledge of working with one of the most experienced Hadoop cluster administrator Andrew Ryan. He observed that a few machines in the Hadoop cluster are always repeat offenders: they land into trouble, gets incarcerated and fixed and then when put back online they create trouble again. He came up with this metric to determine when to throw a machine out of the Hadoop cluster. The following chart shows that 3% of our machines cause 43% of all manual repair events.</span></span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"><br /></span></span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"><span class="Apple-style-span" style=" color: rgb(0, 0, 238); -webkit-text-decorations-in-effect: underline; font-family:Georgia, serif;font-size:15.8333px;"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhUakuzetb45-8Ntkk-d5piSzAh7elspEG5uQx88aoU13myt44i5CxrlDyS1MO_DTQ_5jWmR3Bui6BksOvHUFzmmJ9NS49kvUVxoXOV0vKSAPNZmx55eHy0Uf8cy50GRg6OWDHleIShQGU/s400/chart1.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5488451103155873362" style="display: block; margin-top: 0px; margin-right: auto; margin-bottom: 10px; margin-left: auto; text-align: center; cursor: pointer; width: 480px; height: 333px; " /></span></span></span></div><div><span class="Apple-style-span" style="color:#0000EE;"><br /></span></div><div><span class="Apple-style-span" style=" ;font-family:arial;font-size:15.8333px;">This clearly shows the need for a three-strikes-you-are-out law: if a machine goes into repair three times it is better to take it permanently out of the Hadoop cluster. </span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"><br /></span></span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"><b>An exotic question</b></span></span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">Other exotic questions that I am frequently asked are lik</span></span><i><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">e do map-reduce jobs written in python have a higher probability of failure</span></span></i><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">? Here is a chart that tries to answer this question</span></span>:</div><div><span class="Apple-style-span" style=" color: rgb(0, 0, 238); -webkit-text-decorations-in-effect: underline; font-size:15.8333px;"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjdqSVQB0lQDbj2OK9L_ZPHpmHpH9vInB4ySNrYcVCqcSsl00d93fnf8R2pShZDYsaYkEzXHQu-XG9KrEZtIW7vkjsW71XfcqHBpJABHTrsWolQYmJd6ViaWKpNZQlW6iO6VJQuaXjdyE/s320/python.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5488445976339780546" style="display: block; margin-top: 0px; margin-right: auto; margin-bottom: 10px; margin-left: auto; text-align: center; cursor: pointer; width: 384px; height: 375px; " /></span></div><div><div><ul><li><span class="Apple-style-span"><span class="Apple-style-span" style="font-size:medium;">5% of all jobs in cluster are written in Python</span></span></li><li><span class="Apple-style-span"><span class="Apple-style-span" style="font-size:medium;">15% of cluster CPU is consumed by Python jobs</span></span></li><li><span class="Apple-style-span"><span class="Apple-style-span" style="font-size:medium;"> 20% of all failed jobs are written in python</span></span></li></ul></div></div><div><span class="Apple-style-span" style="font-size:medium;">This does show that jobs written in python consume more CPU on the average than jobs written in Java. It also shows that a greater percentage of these jobs are likely to fail. Why is this? I do not have a definite answer but I my guess is that a developer is more likely to write the first few version of his experimental query in python because it is an easy-to-prototype language.</span></div><div><br /></div><div>I presented a <a href="http://borthakur.com/ftp/hadoop_failures.pdf">more detailed version</a> of this in a IFIP Working Group on Dependable Computing (<a href="http://www.dsn.org/">DSN2010</a>). The aim of this workshop is to understand more about failure patterns on Hadoop nodes, automatic ways to analyze and handle these failures and how the research community can help Hadoop become more fault-tolerant.</div><div><br /></div><div><br /></div><div><span class="Apple-style-span" style=" border-collapse: collapse; font-family:arial, sans-serif;font-size:12.5px;"><br /></span></div><div><span class="Apple-style-span" style="font-family:arial, sans-serif;font-size:130%;"><span class="Apple-style-span" style="border-collapse: collapse; font-size:15px;"><br /></span></span></div></div>Dhruba Borthakurhttp://www.blogger.com/profile/10832366855372649190noreply@blogger.com2tag:blogger.com,1999:blog-8556003786324804973.post-34496872414081953662010-05-09T01:14:00.000-07:002010-06-12T15:34:06.552-07:00Facebook has the world's largest Hadoop cluster!<div style="text-align: left;"><span class="Apple-style-span" style=" ;font-family:arial;font-size:15.8333px;"><b><span class="Apple-style-span" style="font-weight: normal; "><b><span class="Apple-style-span" style="font-size:large;">It is not a secret anymore!</span></b></span></b></span></div><div style="text-align: left;"><span class="Apple-style-span" style="font-family:arial;"><b><span class="Apple-style-span" style="font-weight: normal; "><b><span class="Apple-style-span" style="font-size:medium;"><br /></span></b></span></b></span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">The Datawarehouse Hadoop cluster at Facebook has become the largest known Hadoop storage cluster in the world. Here are some of the details about this single HDFS cluster:</span></span></div><div><div><ul><li><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">21 PB of storage in a single </span><a href="http://hadoop.apache.org/common/docs/current/hdfs_design.html"><span class="Apple-style-span" style="font-size:medium;">HDFS</span></a><span class="Apple-style-span" style="font-size:medium;"> cluster</span></span></li><li><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">2000 machines</span></span></li><li><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">12 TB per machine (a few machines have 24 TB each)</span></span></li><li><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">1200 machines with 8 cores each + 800 machines with 16 cores each</span></span></li><li><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">32 GB of RAM per machine</span></span></li><li><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">15 map-reduce tasks per machine</span></span></li></ul><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">That's a total of more than 21 PB of configured storage capacity! This is larger than the previously known Yahoo!'s </span></span><span class="Apple-style-span" style="font-family:arial;"><a href="http://developer.yahoo.net/blogs/hadoop/2010/05/scalability_of_the_hadoop_dist.html"><span class="Apple-style-span" style="font-size:medium;">cluster</span></a></span><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"> of 14 PB. Here are the cluster statistics from the HDFS cluster at Facebook:</span></span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"><br /></span></span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiKLUA-7UPMrTjbZhG_fHIu0bMO3gGcSIfyQOyJVh42C_wIdTd_PiRsWJglwlaA4BFpPaNLNcvSuftieG9zOO8GNZvULhF3cpiQ6eoI-Yj9m-inTfaxj-gJWMqn29XoBtjx4R8uSoPSaFs/s400/Screen+shot+2010-05-11+at+11.54.43+PM.png" style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;width: 480px; height: 192px;" border="0" alt="" id="BLOGGER_PHOTO_ID_5470273732177902914" /></span></span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"><br /></span></span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"><br /></span></span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"><br /></span></span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"><br /></span></span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"><br /></span></span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"><br /></span></span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"><br /></span></span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"><br /></span></span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"><br /></span></span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"><br /></span></span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"><br /></span></span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"></span></span><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">Hadoop started at Yahoo! and full marks to Yahoo! for developing such critical infrastructure technology in the open. I started working with Hadoop when I joined Yahoo! in 2006. Hadoop was in its infancy at that time and I was fortunate to be part of the core set of Hadoop engineers at Yahoo!. Many thanks to </span></span><a href="http://www.facebook.com/doug.cutting"><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">Doug Cutting</span></span></a><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"> for creating Hadoop and </span></span><a href="http://www.facebook.com/jeric14"><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">Eric14</span></span></a><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"> for convincing the executing management at Yahoo! to develop Hadoop as open source software.</span></span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"><br /></span></span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">Facebook engineers work closely with the Hadoop engineering team at Yahoo! to push Hadoop to greater scalability and performance. Facebook has many Hadoop clusters, the largest among them is the one that is used for Datawarehousing. Here are some statistics that describe a few characteristics of the Facebook's Datawarehousing Hadoop cluster:</span></span></div></div><div><ul><li><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">12 TB of</span><span class="Apple-style-span" style="font-size:medium;"> compressed data added per day</span></span></li><li><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">800 TB of compressed data scanned per day</span></span></li><li><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">25,000 map-reduce jobs per day</span></span></li><li><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">65 millions files in HDFS</span></span></li><li><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">30,000 simultaneous clients to the HDFS NameNode</span></span></li></ul><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">A majority of this data arrives via </span></span><a href="http://www.facebook.com/pages/Scribe/84758500701"><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">scribe</span></span></a><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">, as desribed in </span></span><a href="http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html"><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">scribe-hdfs</span></span></a><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"> integration. This data is loaded in </span></span><a href="http://hadoop.apache.org/hive/"><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">Hive</span></span></a><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">. Hive provides a very elegant way to query the data stored in Hadoop. Almost 99.9% Hadoop jobs at Facebook are generated by a Hive front-end system. We provide lots more details about our scale of operations in our paper at </span></span><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"><a href="http://www.sigmod2010.org/index.shtml">SIGMOD</a></span></span><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"> titled </span></span><b><i><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"><a href="http://portal.acm.org/citation.cfm?id=1807167.1807278&coll=GUIDE&dl=GUIDE&type=series&idx=SERIES473&part=series&WantType=Proceedings&title=SIGMOD&CFID=91115719&CFTOKEN=27338520">Datawarehousing and Analytics Infrastructure at Faceboo</a><a href="http://www.borthakur.com/sigmod2010.pdf">k. </a></span></span></i></b></div><div><b><i><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"><br /></span></span></i></b></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">Here are two pictorial representations of the rate of growth of the Hadoop cluster:</span></span></div><div><span class="Apple-style-span" style="font-family:arial;"><br /></span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi4IXC2u_7dHPm14xWEGE3qaDWY7R_2DyxFoZ__taEzE61cET6_bD5zmyes2sqnQkDpKs3AisQ-3fugWA2hm0MK-PG-g5N8ol2faOTX6fu-C1-QFPSi05a_Y2c9R7g6-mqESmdjbnuUrTI/s400/size.jpg" /> <span class="Apple-style-span" style=" ;font-family:Georgia, serif;font-size:15.8333px;"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhKuhJK9aR7u4-a4i9f9KKGJchb9YpcXu_qG7rhesiUkK1g3jMGpGyBTDc-oAIaSdMYV6IjPskPC6gJpnyIvn-acgoSSsTxSjw1H1MW8TfjWZzBGpggqt8oJ5gctJ-06xjmdUAAGS6uoc/s400/numMachines.jpg" /></span></span></span></div><div><span class="Apple-style-span" style="font-family:arial;"><br /></span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"><br /></span></span></div><div><span class="Apple-style-span" style="font-family:arial;"><b><span class="Apple-style-span" style="font-size:large;">Details about our Hadoop configuration</span></b></span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"><br /></span></span></div><div><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">I have fielded many questions from developers and system administrators about the Hadoop configuration that is deployed in the Facebook Hadoop Datawarehouse. Some of these questions are from Linux kernel developers who would like to make Linux </span></span><a href="http://en.wikipedia.org/wiki/Paging"><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">swapping</span></span></a><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"> work better with Hadoop workload; other questions are from </span></span><a href="http://en.wikipedia.org/wiki/Java_Virtual_Machine"><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">JVM</span></span></a><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"> developers who may attempt to make Hadoop run faster for processes with large heap size; yet others are from </span></span><a href="http://en.wikipedia.org/wiki/Graphics_processing_unit"><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;">GPU</span></span></a><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"> architects who would like to port a Hadoop workload to run on GPUs. To enable this type of outside research, here are the details</span></span><span class="Apple-style-span" style="font-family:arial;"><span class="Apple-style-span" style="font-size:medium;"> about the <a href="http://borthakur.com/ftp/conf.tar.gz">Facebook's Hadoop warehouse configurations</a>. I hope this open sharing of infrastructure details from Facebook jumpstarts the research community to design ways and means to optimize systems for Hadoop usage.</span></span></div></div><div><br /></div><div><br /></div><div><br /></div></div>Dhruba Borthakurhttp://www.blogger.com/profile/10832366855372649190noreply@blogger.com53tag:blogger.com,1999:blog-8556003786324804973.post-76208989573073705472010-04-25T21:42:00.000-07:002010-04-26T00:51:07.290-07:00The Curse of the Singletons! The Vertical Scalability of Hadoop NameNode<div><b>Introduction</b></div><div><br /></div> HDFS is designed to be a highly scalable storage system and sites at <a href="http://www.facebook.com/">Facebook</a> and <a href="http://www.yahoo.com/">Yahoo </a>have 20PB size file systems in <a href="http://borthakur.com/ftp/hadoopmicrosoft.pdf">production deployments</a>. The <a href="http://hadoop.apache.org/common/docs/current/hdfs_design.html">HDFS</a> <a href="http://hadoop.apache.org/common/docs/current/hdfs_design.html#NameNode+and+DataNodes">NameNode</a> is the master of the Hadoop Distributed File System (HDFS). It maintains the critical data structures of the entire file system. Most of HDFS design has focussed on scalability of the system, i.e. the ability to support a large number of slave nodes in the cluster and an even larger number of files and blocks. However, a 20PB size cluster with 30K simultaneous clients requesting service from a single NameNode means that the NameNode has to run on a high-end non-commodity machine. There has been some efforts to scale the NameNode horizontally, i.e. allow the NameNode to run on multiple machines. I will defer analyzing those horizontal-scalability-efforts for a future blog post, instead let's discuss ways and means to make our <a href="http://en.wikipedia.org/wiki/Singleton_pattern"><b>singleton</b></a> NameNode support an even greater load.<div><br /><br /><span style="font-weight:bold;">What are the bottlenecks of the NameNode?<span class="Apple-style-span" style="font-weight: normal;"><br /></span></span></div><div><br /></div><div><b>Network:</b> We have around 2000 nodes in our cluster and each node is running 9 mappers and 6 reducers simultaneously. This means that there are around 30K simultaneous clients requesting service from the NameNode. The Hive <a href="http://wiki.apache.org/hadoop/Hive/GettingStarted#Metadata_Store">Metastore</a> and the HDFS <a href="http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html">RaidNode </a> imposes additional load on the NameNode. The Hadoop <a href="http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/ipc/RPC.Server.html">RPCServer</a> has a <b>singleton</b> <i>Listener Thread</i> that pulls data from all incoming RPCs and hands it to a bunch of NameNode handler threads. Only after all the incoming parameters of the RPC are copied and deserialized by the <i>Listener Thread </i>does the NameNode handler threads get to process the <a href="http://en.wikipedia.org/wiki/Remote_procedure_call">RPC</a>. One CPU core on our NameNode machine is completely consumed by the <i>Listener Thread</i>. This means that during times of high load, the <i>Listener Thread</i> is unable to copy and deserialize all incoming RPC data in time, thus leading to clients encountering RPC socket errors. This is one big bottleneck to vertically scalabiling of the NameNode.</div><div><br /></div><div><b>CPU</b>: The second bottleneck to scalability is the fact that most critical sections of the NameNode is protected by a <b>singleton</b> lock called the FSNamesystem lock. I had done some major restructuring of this code about three years ago via <a href="https://issues.apache.org/jira/browse/HADOOP-1269">HADOOP-1269 </a>but even that is not enough for supporting current workloads. Our NameNode machine has 8 cores but a fully loaded system can use at most only 2 cores simultaneously on the average; the reason being that most NameNode handler threads encounter serialization via the FSNamesystem lock.</div><div><br /></div><div><span class="Apple-style-span"><b><span class="Apple-style-span" style="font-size:medium;">Memory</span></b><span class="Apple-style-span" style="font-size:medium;">: The NameNode stores all its metadata in the main memory of the </span><b><span class="Apple-style-span" style="font-size:medium;">singleton</span></b><span class="Apple-style-span" style="font-size:medium;"> machine on which it is deployed. In our cluster, we have about 60 million files and 80 million blocks; this requires the NameNode to have a heap size of about 58GB. This is huge! There isn't any more memory left to grow the NameNode's heap size! What can we do to support even greater number of files and blocks in our system?</span></span></div><div><br /></div><div><b>Can we break the impasse?</b></div><div><br /></div><div><b>RPC Server</b>: We enhanced the Hadoop RPC Server to have a pool of <i>Reader Threads</i> that work in conjunction with the <i>Listener Thread</i>. The <i>Listener Thread</i> accepts a new connection from a client and then hands over the work of RPC-parameter-deserialization to one of the <i>Reader Threads</i>. In our case, we configured our system so that the <i>Reader Threads</i> consist of 8 threads. This change has <b>doubled</b> the number of RPCs that the NameNode can process at full throttle. This change has been contributed to the Apache code via <a href="https://issues.apache.org/jira/browse/HADOOP-6713">HADOOP-6713</a>.</div><div><br /></div><div>The above change allowed a simulated workload to be able to consume 4 CPU cores out of a total of 8 CPU cores in the NameNode machine. Sadly enough, we still cannot get it to use all the 8 CPU cores!</div><div><br /></div><div><b>FSNamesystem lock</b>: A review of our workload showed that our NameNode typically has the following distribution of requests:</div><div><ul><li>stat a file or directory 47%</li><li>open a file for read 42%</li><li>create a new file 3%</li><li>create a new directory 3%</li><li>rename a file 2%</li><li>delete a file 1%</li></ul>The first two operations constitues about 90% workload for the NameNode and are readonly operations: they do not change file system metadata and do not trigger any synchronous transactions (the <a href="http://issues.apache.org/jira/browse/HADOOP-1869">access time</a> of a file is updated asynchronously). This means that if we change the FSnamesystem lock to a <a href="http://en.wikipedia.org/wiki/Readers-writer_lock">Readers-Writer</a> lock we can achieve the full power of all processing cores in our NameNode machine. We did just that, and we saw yet another <b>doubling</b> of the processing rate of the NameNode! The load simulator can now make the NameNode process use all 8 CPU cores of the machine simultaneously. This code has been contributed to Apache Hadoop via <a href="http://issues.apache.org/jira/browse/HDFS-1093">HDFS-1093</a>.</div><div><br /></div><div>The memory bottleneck issue is still unresolved. People have asked me if the NameNode can keep some portion of its metadata in disk, but this will require a change in locking model design first. One cannot keep the FSNamesystem lock while reading in data from the disk: this will cause all other threads to block thus throttling the performance of the NameNode. Could one use flash memory effectively here? Maybe an LRU cache of file system metadata will work well with current metadata access patterns? If anybody has good ideas here, please share it with the Apache Hadoop community.</div><div><br /></div><div><b>In a Nutshell</b></div><div><b><br /></b></div><div>The two proposed enhancements have improved NameNode scalability by a factor of 8. Sweet, isn't it?</div><div><br /></div>Dhruba Borthakurhttp://www.blogger.com/profile/10832366855372649190noreply@blogger.com13tag:blogger.com,1999:blog-8556003786324804973.post-7434428049249711312010-02-06T17:00:00.000-08:002010-04-25T02:40:48.903-07:00Hadoop AvatarNode High Availability<script type="text/javascript"><br />var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");<br />document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));<br /></script><br /><script type="text/javascript"><br />try {<br />var pageTracker = _gat._getTracker("UA-11827110-2");<br />pageTracker._trackPageview();<br />} catch(err) {}</script><br /><div><span class="Apple-style-span" style="font-family:arial;"><b>Our Use-Case</b></span></div><div><br /></div><div>The Hadoop Distributed File System's (HDFS) NameNode is a single point of falure. This has been a major stumbling block in using HDFS for a 24x7 type of deployment. It has been a<a href="http://hadoopblog.blogspot.com/2009/11/hdfs-high-availability.html"> topic of discussion</a> among a wide circle of engineers.</div><div><br /></div><div>I am part of a team that is operating a cluster of 1200 nodes and a total size of 12 PB. This cluster is currently running hadoop 0.20. The NameNode is configured to write its transaction logs to a shared NFS filer. The biggest cause of service downtime is when we need to deploy hadoop patches to our cluster. A fix to the DataNode software is easy to deploy without cluster downtime: we deploy new code and restart few DataNodes at a time. We can afford to do this because the DFSClient is equipped (via multiple retries to different replicas of the same block) to handle transient DataNode outages. The major problem arises when new software have to be deployed to our NameNode. A HDFS NameNode restart for a cluster of our size typically takes about an hour. During this time, the applications that use the Hadoop service cannot run. This led us to believe that some kind of High Availabilty (HA) for the NameNode is needed. We wanted a design than can support a failover in about a minute. </div><div><br /></div><div><b>Why AvatarNode?</b></div><div><br /></div><div>We took a serious look at the options available to us. The HDFS <a href="https://issues.apache.org/jira/browse/HADOOP-4539">BackupNode</a> is not available in 0.20, and upgrading our cluster to hadoop 0.20 is not a feasible option for us because it has to go through extensive time-consuming testing before we can deploy it in production. We started off white-boarding the simplest solution: the existance of a Primary NameNode and a Standby NameNode in our HDFS cluster. And we did not have to worry about split-brain-scenario and <a href="http://en.wikipedia.org/wiki/Fencing_(computing)">IO-fencing</a> methods... the administrator ensures that the Primary NameNode is really, really dead before coverting the Standby NamenNode to become the Primary DataNode. We also did not want to change the code of the NameNode one single bit and wanted to build HA as a software layer on top of HDFS. We wanted a daemon that can switch from being a Primary to a Standby and vice-versa... as if switching <a href="http://en.wikipedia.org/wiki/Avatar">avatars</a> ... and this is how our AvatarNode is born!</div><div><br /></div><div>The word avatar means a <a href="http://www.merriam-webster.com/dictionary/avatar">variant-phase</a> or incarnations of a single entity, and we thought it appropriate that a wrapper that can make the NameNode behave as two different roles be aptly named as AvatarNode. Our design of the AvatarNode is such that it is a pure-wrapper around the existing NameNode that exists in Hadoop 0.20; thus the AvatarNode inherits the scalability, performance and robustness of the NameNode.</div><div><br /></div><div><b>What is a AvatarNode?</b></div><div><br /></div><div>The AvatarNode encapsulates an instance of the NameNode. You can start a AvatarNode in the Primary avatar or the Standby avatar. If you start it as a Primary avatar, then it behaves exactly as the NameNode you know now. In fact, it runs exactly the same code as the NameNode. The AvatarNode in its Primary avatar writes the HDFS transaction logs into the shared NFS filer. You can then start another instance of the AvatarNode on a different machine, but this time telling it to start as a Standby avatar. </div><div><br /></div><div>The Standby AvatarNode encapsulates a NameNode and a <a href="http://hadoop.apache.org/common/docs/r0.17.2/hdfs_user_guide.html#Secondary+Namenode">SecondaryNameNode</a> within it. The Standby AvatarNode continuously keeps reading the HDFS transaction logs from the same NFS filer and keeps feeding those transactions to the encapsulated NameNode instance. The NameNode that runs within the Standby AvatarNode is kept in <a href="http://hadoop.apache.org/common/docs/current/hdfs_user_guide.html#Safemode">SafeMode</a>. Keeping the Standby AvatarNode is SafeMode prevents it from performing any active duties of the NameNode but at the same time keeping a copy of all NameNode metadata <i>hot</i> in its in-memory data structures. </div><div><br /></div><div>HDFS clients are configured to access the NameNode via a <a href="http://en.wikipedia.org/wiki/Virtual_IP_address">Virtual IP Address</a> (VIP). At the time of failover, the administrator kills the Primary AvatarNode on machine M1 and instructs the Standby AvatarNode on machine M2 (via a command line utility) to assume a Primary avatar. It is guaranteed that the Standby AvatarNode ingests all committed transactions because it reopens the edits log and consumes all transactions till the end of the file; this guarantee depends on the fact that NFS-v3 supports <a href="http://nfs.sourceforge.net/">close-to-open</a> cache coherency semantics. The Standby AvatarNode finishes ingestion of all transactions from the shared NFS filer and then leaves SafeMode. Then the administrator switches the VIP from M1 to M2 thus causing all HDFS clients to start accessing the NameNode instance on M2. This is practically instantaneous and the failover typically takes a few seconds!</div><div><br /></div><div>There is no need to run a separate instance of the SecondaryNameNode. The AvatarNode, when run in the Standby avatar, performs the duties of the SecondaryNameNode too. The Standby AvatarNode comtinually ingests transactions from the Primary, but once a while it pauses the ingestion, invokes the SecondaryNameNode to create and upload a checkpoint to the Primary AvatarNode and then resumes the ingestion of transaction logs.</div><div><br /></div><div><b>The AvatarDataNode</b></div><div><br /></div><div>The AvatarDataNode is a wrapper around the vanilla DataNode found in hadoop 0.20. It is configured to send block reports and blockReceived messages to two AvatarNodes. The AvatarDataNode do not use the VIP to talk to the AvatarNode(s) (only HDFS clients use the VIP). An <a href="http://issues.apache.org/jira/browse/HDFS-839">alternative to this approach</a> would have been to make the Primary AvatarNode forward block reports to the Standby AvatarNode. We discounted this alternative approach because it adds code complexity to the Primary AvatarNode: the Primary AvatarNode would have to do buffering and flow control as part of forwarding block reports. </div><div><br /></div><div>The Hadoop DataNode is already equipped to send (and retry if needed) blockReceived messages to the NameNode. We extend this functionality by making the AvatarDataNode send blockReceived messages to both the Primary and Standby AvatarNodes.</div><div><br /></div><div><span class="Apple-style-span" style="font-family:arial;"><b>Does the failover affect HDFS clients?</b></span></div><div><br /></div><div><div>An HDFS client that was reading a file caches the location of the replicas of almost all blocks of a file at the time of opening the file. This means that HDFS readers are not impacted by AvatarNode failover.</div><div><br /></div><div>What happens to files that were being written by a client when the failover occurred? The Hadoop 0.20 NameNode does not record new block allocations to a file until the time when the file is closed. This means that a client that was writing to a file when the failover occured will encounter an IO exception after a successful failover event. This is somewhat tolerable for map-reduce job because the Hadoop MapReduce framework will retry any failed tasks. This solution also works well for <a href="http://hadoop.apache.org/hbase/">HBase</a> because the HBase issues a <a href="http://www.netlikon.de/docs/javadoc-hadoop/branch-0.20/org/apache/hadoop/fs/FSDataOutputStream.html#sync()">sync/hflush</a> call to persist HDFS file contents before marking a HBase-transaction to be completed. In future, it is possible that HDFS may be enhanced to record every new block allocation in the HDFS transaction log, details <a href="http://issues.apache.org/jira/browse/HDFS-978">here</a>. In that case, a failover event will not impact HDFS writers as well.</div><div><br /></div><div><b><span class="Apple-style-span" style="font-family:arial;">Other NameNode Failover techniques</span></b></div><div><br /></div></div><div>People have asked me to compare the AvatarNode HA implementation with the<a href="http://de-de.facebook.com/note.php?note_id=106157472002"> Hadoop with DRBD and Linux HA</a>. Both approaches require that the Primary writes transaction logs to a shared device. The difference is that the Standby AvatarNode is a hot standby whereas the DRBD-LinuxHA solution is a cold standby. The failover time for our AvatarNode is about a minute, whereas the failover time of the DRBD-LinuxHA would require about an hour for our cluster that has about 50 million files. The reason that the AvatarNode can function as a hot standby is because the Standby AvatarNode has an instance of the NameNode encapsulated within it and this encapsulated NameNode receives messages from the DataNodes thus keeping its metadata state fully upto-date.</div><div><br /></div><div>Another implementation that I have heard about is from a team at China Mobile Research Institue, their approach is described <a href="http://gnawux.info/hadoop/2010/01/pratice-of-namenode-cluster-for-hdfs-ha/">here</a>. </div><div><br /></div><div><span class="Apple-style-span" style="font-family:arial;"><b>Where is the code?</b></span></div><div><br /></div><div>This code has been contributed to the Apache HDFS project via <a href="http://issues.apache.org/jira/browse/HDFS-976">HDFS-976</a>. A prerequisite for this patch is <a href="http://issues.apache.org/jira/browse/HDFS-966">HDFS-966</a>. </div>Dhruba Borthakurhttp://www.blogger.com/profile/10832366855372649190noreply@blogger.com61tag:blogger.com,1999:blog-8556003786324804973.post-83435983763219362009-11-09T00:32:00.001-08:002009-11-09T23:29:17.396-08:00HDFS High AvailabilityI have encountered plenty of questions about the single point of failure for the HDFS NameNode. The most common concern being that if the NameNode dies, then the whole cluster is unavailable. This means that HDFS is unsuitable for applications that need a high degree of uptime. This is not a problem when you run <a href="http://en.wikipedia.org/wiki/MapReduce">map-reduce</a> jobs on HDFS, especially because a map-reduce system is a batch system and the uptime requirements of a batch system is typically not that stringent.<br /><br />In recent times, I am starting to see lots of distributed applications that are using HDFS as a general purpose storage system. These applications range from multimedia servers that store mails, photos, videos, etc to systems that store updates from thousands of distributed sensors. These applications prefer HDFS because they can store a large volume of data. These applications do not use map-reduce to mine any of the information stored in HDFS. However, these applications do need consistency and availability of data. The <a href="http://www.julianbrowne.com/article/viewer/brewers-cap-theorem">Brewer's CAP Theorem</a> states that most distributed systems need some tradeoffs among Consistency, Availability and Partition tolerance. HDFS's does an excellent job of providing Consistency of data at all times. Traditionally, it did not address Availability and Partition tolerance earlier. That could change to a certain extent with HDFS 0.21 release. HDFS 0.21 has a new entity called the <a href="http://issues.apache.org/jira/browse/HADOOP-4539">BackupNode</a> that receives real-time updates about transactions from the NameNode. This is the first step in making making HDFS highly available.<br /><br />Here is a <a href="http://www.borthakur.com/ftp/hdfs_high_availability.pdf">slide-deck</a> that I wanted to present at <a href="http://us.apachecon.com/c/acus2009/">ApacheCon 2009 </a>about the current state of affairs regarding High Availabilty with HDFS.Dhruba Borthakurhttp://www.blogger.com/profile/10832366855372649190noreply@blogger.com0tag:blogger.com,1999:blog-8556003786324804973.post-22890581078241443292009-10-19T02:20:00.000-07:002009-10-19T02:47:36.732-07:00Hadoop discussions at Microsoft ResearchI was invited to present a talk about Hadoop File System Architecture at Microsoft Research at Seattle. This is a research group and is focussed on long-term research, so it is no surprise that they are interested in knowing how a growing company like Facebook is using Hadoop to its advantage.<br /><br />I met a few folks who chatted with me about how Microsoft SQL Server is being modified to handle large scala databases. These folks heartily agreed with a comment I made in my presentation that Dr. Dewitt and Dr. Stonebraker is missing the point when they are comparing performance numbers between Hadoop with traditional Database systems.... rather than comparing the scalability and fault-tolerance of these systems. I had learned some of the fundamentals of Database systems from Professor Dewitt during my graduate studies at Uiversity of Wisconsin Madison, but Dr Dewitt is a Microsoft employee now!<br /><br />The fact that Facebook uses the SQL interface of Hive layered over Hadoop makes it even more interesting to Microsoft. They wanted to know the performance difference between Hive and PIG and would like to compare them to their distributed-SQL-Server software.<br /><br />Here are the <a href="http://borthakur.com/ftp/hadoopmicrosoft.pdf">slides</a> I used for my presentation.Dhruba Borthakurhttp://www.blogger.com/profile/10832366855372649190noreply@blogger.com2tag:blogger.com,1999:blog-8556003786324804973.post-28280035920900799782009-10-02T11:57:00.000-07:002009-10-07T10:46:10.662-07:00I presented a set of slides that describes the <a href="http://borthakur.com/ftp/hadoopworld.pdf">Hadoop development at Facebook</a> at the <a href="http://www.cloudera.com/hadoop-world-nyc">HadoopWorld</a> conference in New York today. It was well received by more than 100 people. I have presented at many-a-conferences in the west coast but this is the first time I have presented at a conference in New York... there are more hadoop users here versus mostly hadoop developers in the west coast. There were plenty of questions, especially about Hadoop-Archive and Realtime-Hadoop. There were people asking me questions about <a href="http://issues.apache.org/jira/browse/HDFS-245">HDFS Symbolic links</a> and <a href="http://issues.apache.org/jira/browse/HIVE-809">HDFS-scribe </a>copier.<br /><br />Earlier, I visited the university of <a href="http://www.cse.nd.edu/">Notre Dame</a> to conduct a department seminar and present a guest lecture for the graduate students at the Department of Computer Science. There is plenty of interesting research being led by <a href="http://www.cse.nd.edu/%7Edthain/">Prof Douglas Thain.</a> One interesting research idea that came up was to place HDFS block replicas by analyzing HDFS access patterns. It is possible that we can provide HDFS datanode/namenode logs to researchers who can analyze these logs to come up with better algorithms for HDFS block replica placement.Dhruba Borthakurhttp://www.blogger.com/profile/10832366855372649190noreply@blogger.com4tag:blogger.com,1999:blog-8556003786324804973.post-72085420495390811642009-09-14T22:41:00.000-07:002009-09-14T23:54:09.917-07:00HDFS block replica placement in your hands now!Most Hadoop administrators set the default replication factor for their files to be three. The main assumption here is that if you keep three copies of the data, your data is safe. I have observed this to be true in the big clusters that we manage and operate. In actuality, administrators are managing two failure aspects: <span style="font-style: italic;">data corruption </span>and <span style="font-style: italic;">data availability</span>.<br /><br />If all the datanodes on which the replicas of a block exist catch fire at the same time, then that data is lost and cannot be recovered. Or if an administrative error causes all the existing replicas of a block to be deleted, then it is a catastrophic failure. This is <span style="font-style: italic;">data corruption. </span>On the other hand, if a rack switch goes down for sometime, the datanodes on that rack are in-accessible during that time. When that faulty rack switch is fixed, the data on the rack rejoins the HDFS cluster and life goes on as usual. This is a <span style="font-style: italic;">data avilability </span>issue; in this case data was not corrupted or lost, it was just unavailable for some time. HDFS keeps three copies of a block on three different datanodes to protect against true <span style="font-style: italic;">data corruption</span>. HDFS also tries to distribute these three replicas on more than one rack to protect against <span style="font-style: italic;">data availability</span> issues. The fact that HDFS actively monitors any failed datanode(s) and upon failure detection immediately schedules re-replication of blocks (if needed) implies that three copies of data on three different nodes is sufficient to avoid corrupted files.<br /><br />HDFS uses a simple but highly effective policy to allocate replicas for a block. If a process that is running on any of the HDFS cluster nodes open a file for writing a block, then one replica of that block is allocated on the same machine on which the client is running. The second replica is allocated on a randomly chosen rack that is different from the rack on which the first replica was allocated. The third replica is allocated on a randomly chosen machine on the same remote rack that was chosen in the earlier step. This means that a block is present on two unique racks. One point to note is that there is no relationship between replicas of different blocks of the same file as far as their location is concerned. Each block is allocated independently.<br /><br />The above algorithm is great for availability and scalability. However, there are scenarios where co-locating many block of the same file on the same set of datanode(s) or rack(s) is beneficial for performance reasons. For example, if many blocks of the same file are present on the same datanode(s), a single mapper instance could process all these blocks using the <a href="http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html">CombineFileInputFormat</a>. Similarly, if a dataset contains many small files that are co-located on the same datanode(s) or rack(s), one can use CombineFileInputFormat to process all these file together by using fewer mapper instances via <a href="http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html">CombineFileInputFormat</a>. If an application always uses one dataset with another dataset (think <a href="http://hadoop.apache.org/hive/">Hive</a> or <a href="http://hadoop.apache.org/pig/">Pig</a> join), then co-locating these two datasets on the same set of datanodes is beneficial.<br /><br />Another reason when one might want to allocate replicas using a different policy is to ensure that replicas and their <a href="http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html">parity blocks </a>truly reside in different failure domains. The <a href="http://issues.apache.org/jira/browse/HDFS-503">erasure code work in HDFS</a> could effectively bring down the physical replication factor of a file to about 1.5 (while keeping the logical replication factor at 3) if it can place replicas of all blocks in a stripe more intelligently.<br /><br />Yet another reason, however exotic, is to allow HDFS to place replicas based on the <a href="http://en.wikipedia.org/wiki/Heat_map">HeatMap</a> of your cluster. If one of of the node in the cluster is at a higher temperature than that of another, then it might be better to prefer the cooler node while allocating a new replica. If you want to experiment with HDFS across two <a href="http://en.wikipedia.org/wiki/Data_center">data centers,</a> you might want to try out new policies for replica placement.<br /><br />Well, now you can finally get your hands wet! <a href="http://issues.apache.org/jira/browse/HDFS-385">HDFS-385</a> is part of the Hadoop trunk and will be part of the next major HDFS 0.21 release. This feature provides a way for the adventurous developer to write Java code that specifies how HDFS should allocate replicas of blocks of a file. The API is experimental in nature, and could change in the near future if we discover any in-efficiencies in it. Please let the Hadoop community know if you need any changes in this API or if you come across novel uses of this API.Dhruba Borthakurhttp://www.blogger.com/profile/10832366855372649190noreply@blogger.com26tag:blogger.com,1999:blog-8556003786324804973.post-11760040521226672802009-08-28T22:58:00.000-07:002009-08-30T21:54:19.999-07:00HDFS and Erasure Codes (HDFS-RAID)The Hadoop Distributed File System has been great in providing a cloud-type file system. It is robust (when administered correctly :-)) and highly scalable. However, one of the main drawbacks of HDFS is that each piece of data is replicated in three places. This is acceptable because disk storage is cheap and is becoming cheaper by the day; this isn't a problem if you have a relatively small to medium size cluster. The price difference (in absolute terms) is not much whether you use 15 disks or whether you use 10 disks. If we consider the cost of $1 per GByte, the price difference between fifteen 1 TB disk and ten 1 TB disk is only $5K. But when the total size of your cluster is 10 PBytes, then the costs savings in storing the data in two places versus three is a huge ten million dollars!<br /><br />The reason HDFS stores disk blocks in triplicate is because it uses commodity hardware and there is non-negligible probability of a disk failure. It has been observed that a replication factor of 3 and the fact the HDFS aggressively detects failures and immediately replicates failed -block-replicas is sufficient to never lose any data in practice. The challenge now is to achieve an effective replication factor of 3 while keeping the real physical replication factor at close to 2! How best to do it than by using <a href="http://en.wikipedia.org/wiki/Erasure_code">Erasure Codes.</a><br /><br />I heard about this idea called <a href="https://opencirrus.org/system/files/Gibson-OpenCirrus-June9-09.ppt">DiskReduce</a> from the folks at CMU. The <a href="http://www.pdl.cmu.edu/">CMU PDL Labs</a> has been a powerhouse of research in file systems and it is no surprise that they proposed a elegant way of implementing erasure codes in HDFS. I borrowed heavily from their idea in my implementation of Erasure Codes in HDFS described in <a href="http://issues.apache.org/jira/browse/HDFS-503">HDFS-503</a>. One of the main motivation of my design is to keep the HDFS Erasure Coding as a software layer above HDFS rather than inter-twining it inside of HDFS code. The HDFS code is complex by itself and it is really nice to not have to make it more complex and heavyweight.<br /><br />Distributed Raid File System consists of two main software components. The first component is the RaidNode, a daemon that creates parity files from specified HDFS files. The second component "raidfs" is a software that is layered over a HDFS client and it intercepts all calls that an application makes to the HDFS client. If the HDFS client encounters corrupted data while reading a file, the raidfs client detects it; it uses the relevant parity blocks to recover the corrupted data (if possible) and returns the data to the application. The application is completely transparent to the fact that parity data was used to satisfy it's read request. The Distributed Raid File System can be configured in such a way that a set of data blocks of a file are combined together to form one or more parity blocks. This allows one to reduce the replication factor of a HDFS file from 3 to 2 while keeping the failure probabilty relatively same as before.<br /><br />I have seen that using a stripe size of 10 blocks decreases the physical replication factor of a file to 2.2 while keeping the effective replication factor of a file at 3. This typically results in saving 25% to 30% of storage space in a HDFS cluster.<br /><br />One of the shortcoming of this implementation is that we need a parity file for every file in HDFS. This potentially increases the number of files in the NameNode. To alleviate this problem, I will enhance this implementation (in future) to use the <a href="http://hadoop.apache.org/common/docs/current/hadoop_archives.html">Hadoop Archive</a> feature to archive all the parity files together in larger containers so that the NameNode does not have to support additional files when the HDFS Erasure Coding is switched on. This works reasonably well because it is a very very rare case that the parity files are ever used to satisfy a read request.<br /><br />I am hoping that this feature becomes part of Hadoop 0.21 release scheduled for September 2009!Dhruba Borthakurhttp://www.blogger.com/profile/10832366855372649190noreply@blogger.com2tag:blogger.com,1999:blog-8556003786324804973.post-65871771134345898452009-07-28T23:28:00.000-07:002009-08-02T00:44:59.067-07:00Hadoop and CondorMy graduate work in the mid-nineties at the University of Wisconsin focussed on <a href="http://www.cs.wisc.edu/condor/">Condor</a>. Condor has an amazing way to do process checkpointing and migrating processes from one machine to another if needed. It also has a very powerful scheduler that matches job requirements with machine characteristics.<br /><br />One of the major inefficiencies with Hadoop schedulers (Fairshare and Capacity scheduler) is that they are not resource-aware. There has been some work-in-progress in this area, <a href="http://issues.apache.org/jira/browse/HADOOP-5881">HADOOP-5881</a>. Condor's <a href="http://www.cs.wisc.edu/condor/manual/v6.4/4_1Condor_s_ClassAd.html">ClassAds</a> mechanism can be used to match hadoop jobs with machines very elegantly.<br /><br />Here is one of my recent <a href="http://www.borthakur.com/ftp/hadoop_condor.pdf">presentation</a> at the The Israeli Association of Grid Technologies that talks about the synergies between Condor and Hadoop.Dhruba Borthakurhttp://www.blogger.com/profile/10832366855372649190noreply@blogger.com6tag:blogger.com,1999:blog-8556003786324804973.post-31703371225619092152009-06-22T23:53:00.000-07:002009-11-24T03:47:48.779-08:00Hadoop at NetflixNetflix is interested in using Hadooo/Hive to process click logs from the users of their website. Here is what I presented to them in a meeting that was well attended by about 50 engineers. Following the meeting, a bunch of engineers asked me question related to the integration of scribe and hdfs and how Facebook imports click logs into Hadoop.<br /><br />Here is a copy of my presentation .. <a href="http://borthakur.com/ftp/newflixalone.pdf">slides</a>.Dhruba Borthakurhttp://www.blogger.com/profile/10832366855372649190noreply@blogger.com2tag:blogger.com,1999:blog-8556003786324804973.post-78762983047586342792009-06-06T00:34:00.000-07:002009-08-02T00:47:06.021-07:00HDFS Scribe IntegrationIt is finally here: you can configure the open source log-aggregator, <a href="https://sourceforge.net/projects/scribeserver/">scribe</a>, to log data directly into the Hadoop distributed file system. <div><br /></div><div>Many Web 2.0 companies have to deploy a bunch of costly filers to capture weblogs being generated by their application. Currently, there is no option other than a costly filer because the write-rate for this stream is huge. The Hadoop-Scribe integration allows this write-load to be distributed among a bunch of commodity machines, thus reducing the total cost of this infrastructure. </div><div><br />The challenge was to make HDFS be real-timeish in behaviour. Scribe uses <a href="http://wiki.apache.org/hadoop/LibHDFS">libhdfs</a> which is the C-interface to the HDFs client. There were various bugs in libhdfs that needed to be solved first. Then came the FileSystem API. One of the major issues was that the FileSystem API caches FileSystem handles and always returned the same FileSystem handle when called from multiple threads. There was no reference counting of the handle. This caused problems with scribe, because Scribe is highly multi-threaded. A new API FileSystem.newInstance() was introduced to support Scribe.<div><br /></div><div>Making the HDFS write code path more real-time was painful. There are various timeouts/settings in HDFS that were hardcoded and needed to be changed to allow the application to fail fast. At the bottom of this blog-post, I am attaching the settings that we have currently configured to make the HDFS-write very real-timeish. The last of the JIRAS, <a href="http://issues.apache.org/jira/browse/HADOOP-2757">HADOOP-2757 </a>is in the pipeline to be committed to Hadoop trunk very soon.</div><div><br /></div><div>What about Namenode being the single point of failure? This is acceptable in a warehouse type of application but cannot be tolerated by a realtime application. Scribe typically aggregates click-logs from a bunch of webservers, and losing *all* click log data of a website for a 10 minutes or so (minimum time for a namenode restart) cannot be tolerated. The solution is to configure two <span class="Apple-style-span" style="font-style: italic;">overlapping</span> clusters on the same hardware. Run two separate namenodes N1 and N2 on two different machines. Run one set of datanode software on all slave machines that report to N1 and the other set of datanode software on the same set of slave machines that report to N2. The two datanode instances on a single slave machine share the same data directories. This configuration allows HDFS to be highly available for writes! </div><div><br /></div><div>The highly-available-for-writes-HDFS configuration is also required for software upgrades on the cluster. We can shutdown one of the <span class="Apple-style-span" style="font-style: italic;">overlapping</span> HDFS clusters, upgrade it to new hadoop software, and then put it back online before starting the same process for the second HDFS cluster. </div><div><br /></div><div>What are the main changes to scribe that were needed? Scribe already had the feature that it buffers data when it is unable to write to the configured storage. The default scribe behaviour is to replay this buffer back to the storage when the storage is back online. Scribe is configured to support no-buffer-replay when the primary storage is back online. Scribe-hdfs is configured to write data to a cluster N1 and if N1 fails then it writes data to cluster N2. Scribe treats N1 and N2 as two equivalent primary stores. The scribe configuration should have fs_type=hdfs. For <a href="https://sourceforge.net/forum/forum.php?forum_id=962339">scribe compilation</a>, you can use <span class="Apple-style-span" style=";font-family:Helvetica;font-size:12;" >./configure --enable-hdfs LDFLAGS="-ljvm -lhdfs". <span class="Apple-style-span" style=";font-family:Georgia;font-size:16;" >A good example for configuring scribe-hdfs is in a file called hdfs_example2.conf in the scribe code base. </span></span></div><div><br /></div><div>Here are the settings for the Hadoop 0.17 configuration that is needed by an application doing writes in realtime:</div><div><br /></div><div><div> <span class="Apple-style-span" style="font-size:12;"><property></property></span></div><div><span class="Apple-style-span" style="font-size:12;"> <name>ipc.client.idlethreshold</name></span></div><div><span class="Apple-style-span" style="font-size:12;"> <value>10000</value></span></div><div><span class="Apple-style-span" style="font-size:12;"> <description>Defines the threshold number of connections after which</description></span></div><div><span class="Apple-style-span" style="font-size:12;"> connections will be inspected for idleness.</span></div><div><span class="Apple-style-span" style="font-size:12;"> </span></div><div><span class="Apple-style-span" style="font-size:12;"> </span></div><div><span class="Apple-style-span" style="font-size:12;"> <property></property></span></div><div><span class="Apple-style-span" style="font-size:12;"> <name>ipc.client.connection.maxidletime</name></span></div><div><span class="Apple-style-span" style="font-size:12;"> <value>10000</value></span></div><div><span class="Apple-style-span" style="font-size:12;"> <description>The maximum time in msec after which a client will bring down the</description></span></div><div><span class="Apple-style-span" style="font-size:12;"> connection to the server.</span></div><div><span class="Apple-style-span" style="font-size:12;"> </span></div><div><span class="Apple-style-span" style="font-size:12;"> </span></div><div><span class="Apple-style-span" style="font-size:12;"> <property></property></span></div><div><span class="Apple-style-span" style="font-size:12;"> <name>ipc.client.connect.max.retries</name></span></div><div><span class="Apple-style-span" style="font-size:12;"> <value>2</value></span></div><div><span class="Apple-style-span" style="font-size:12;"> <description>Indicates the number of retries a client will make to establish</description></span></div><div><span class="Apple-style-span" style="font-size:12;"> a server connection.</span></div><div><span class="Apple-style-span" style="font-size:12;"> </span></div><div><span class="Apple-style-span" style="font-size:12;"> </span></div><div><span class="Apple-style-span" style="font-size:12;"> <property></property></span></div><div><span class="Apple-style-span" style="font-size:12;"> <name>ipc.server.listen.queue.size</name></span></div><div><span class="Apple-style-span" style="font-size:12;"> <value>128</value></span></div><div><span class="Apple-style-span" style="font-size:12;"> <description>Indicates the length of the listen queue for servers accepting</description></span></div><div><span class="Apple-style-span" style="font-size:12;"> client connections.</span></div><div><span class="Apple-style-span" style="font-size:12;"> </span></div><div><span class="Apple-style-span" style="font-size:12;"> </span></div><div><span class="Apple-style-span" style="font-size:12;"> <property></property></span></div><div><span class="Apple-style-span" style="font-size:12;"> <name>ipc.server.tcpnodelay</name></span></div><div><span class="Apple-style-span" style="font-size:12;"> <value>true</value></span></div><div><span class="Apple-style-span" style="font-size:12;"> <description>Turn on/off Nagle's algorithm for the TCP socket connection on</description></span></div><div><span class="Apple-style-span" style="font-size:12;"> the server. Setting to true disables the algorithm and may decrease latency</span></div><div><span class="Apple-style-span" style="font-size:12;"> with a cost of more/smaller packets.</span></div><div><span class="Apple-style-span" style="font-size:12;"> </span></div><div><span class="Apple-style-span" style="font-size:12;"> </span></div><div><div><span class="Apple-style-span" style="font-size:12;"> <property></property></span></div><div><span class="Apple-style-span" style="font-size:12;"> <name>ipc.client.tcpnodelay</name></span></div><div><span class="Apple-style-span" style="font-size:12;"> <value>true</value></span></div><div><span class="Apple-style-span" style="font-size:12;"> <description>Turn on/off Nagle's algorithm for the TCP socket connection on</description></span></div><div><span class="Apple-style-span" style="font-size:12;"> the client. Setting to true disables the algorithm and may decrease latency</span></div><div><span class="Apple-style-span" style="font-size:12;"> with a cost of more/smaller packets.</span></div><div><span class="Apple-style-span" style="font-size:12;"> </span></div><div><span class="Apple-style-span" style="font-size:12;"> </span></div><div><span class="Apple-style-span" style="font-size:12;"> <property></property></span></div><div><span class="Apple-style-span" style="font-size:12;"> <name>ipc.ping.interval</name></span></div><div><span class="Apple-style-span" style="font-size:12;"> <value>5000</value></span></div><div><span class="Apple-style-span" style="font-size:12;"> <description>The Client sends a ping message to server every period. This is helpful</description></span></div><div><span class="Apple-style-span" style="font-size:12;"> to detect socket connections that were idle and have been terminated by a failed server.</span></div><div><span class="Apple-style-span" style="font-size:12;"> </span></div><div><span class="Apple-style-span" style="font-size:12;"> </span></div><div><span class="Apple-style-span" style="font-size:12;"> <property></property></span></div><div><span class="Apple-style-span" style="font-size:12;"> <name>ipc.client.connect.maxwaittime</name></span></div><div><span class="Apple-style-span" style="font-size:12;"> <value>5000</value></span></div><div><span class="Apple-style-span" style="font-size:12;"> <description>The Client waits for this much time for a socket connect call to be establised</description></span></div><div><span class="Apple-style-span" style="font-size:12;"> with the server.</span></div><div><span class="Apple-style-span" style="font-size:12;"> </span></div><div><span class="Apple-style-span" style="font-size:12;"> </span></div><div><span class="Apple-style-span" style="font-size:12;"> <property></property></span></div><div><span class="Apple-style-span" style="font-size:12;"> <name>dfs.datanode.socket.write.timeout</name></span></div><div><span class="Apple-style-span" style="font-size:12;"> <value>20000</value></span></div><div><span class="Apple-style-span" style="font-size:12;"> <description>The dfs Client waits for this much time for a socket write call to the datanode.</description></span></div><div><span class="Apple-style-span" style="font-size:12;"> </span></div><div><span class="Apple-style-span" style="font-size:12;"> </span></div><div><span class="Apple-style-span" style="font-size:12;"> <property></property></span></div><div><span class="Apple-style-span" style="font-size:12;"> <name>ipc.client.ping</name></span></div><div><span class="Apple-style-span" style="font-size:12;"> <value>false</value></span></div><div><span class="Apple-style-span" style="font-size:12;"> <description>HADOOP-2757</description></span></div><div><span class="Apple-style-span" style="font-size:12;"> </span></div><div><span class="Apple-style-span" style="font-size:12;"> </span></div><div><br /></div></div></div><div><p style="margin: 0px; font-family: Helvetica; font-style: normal; font-variant: normal; font-weight: normal; font-size: 12px; line-height: normal; font-size-adjust: none; font-stretch: normal;"><br /></p></div><div><blockquote></blockquote><br /></div></div>Dhruba Borthakurhttp://www.blogger.com/profile/10832366855372649190noreply@blogger.com29tag:blogger.com,1999:blog-8556003786324804973.post-90128899490874283202009-05-28T23:21:00.000-07:002009-05-29T00:36:51.097-07:00Report from my visit to the Berkeley RAD LabsI went to attend the <a href="http://radlab.cs.berkeley.edu/wiki/RAD_Lab"><span class="blsp-spelling-error" id="SPELLING_ERROR_0">UC</span> Berkeley RAD Lab</a> Spring Retreat held at Santa Cruz. This Lab has about 30 <span class="blsp-spelling-error" id="SPELLING_ERROR_1">Phd</span> students and the quality of their work really impressed me a lot. Most of their work is based on research on distributed systems. There were many students who were working with <span class="blsp-spelling-error" id="SPELLING_ERROR_2">Hadoop</span> and it is amazing to see <span class="blsp-spelling-error" id="SPELLING_ERROR_3">Hadoop</span> being the core of so much research activity... when the <span class="blsp-spelling-error" id="SPELLING_ERROR_4">Hadoop</span> project started three years back, I definitely did not imagine that it will get to this state!<br /><br />I had read <a href="http://www.eecs.berkeley.edu/%7Epattrsn/">David Patterson'</a>s papers during my graduate studies at <a href="http://www.cs.wisc.edu/">Univ of Wisconsin Madison</a>, and it was really nice to be able to meet him in person. And the group of students that he leads at the RAD labs is of very high calibre. Most people must have already seen the <a href="http://d1smfj0g31qzek.cloudfront.net/abovetheclouds.pdf">Above the Cloud</a> whitepaper that the RAD Lab has produced. It tries to clear up the muddle on what Cloud Computing really is, its benefits and its possible usage scenarios. A good read!<br /><br />A paper titled <a href="http://www.usenix.org/events/sysml08/tech/full_papers/xu/xu.pdf">Detecting Large Scale System Problems by Mining Console Logs</a> talks about using application logs from a distributed application to detect bugs, problems and anomalies in the system. They provide an example whereby they process 24 million log lines produced by HDFS to detect a bug (page 11 of the paper). I am not really convinced about the bug, but this is an effort in the right direction.<span style="display: block;" id="formatbar_Buttons"><span class="on" style="display: block;" id="formatbar_CreateLink" title="Link" onmouseover="ButtonHoverOn(this);" onmouseout="ButtonHoverOff(this);" onmouseup="" onmousedown="CheckFormatting(event);FormatbarButton('richeditorframe', this, 8);ButtonMouseDown(this);"></span></span><br />My employer <a href="http://www.facebook.com">Facebook</a> is an enabler for research in distributed systems. To this effect, Facebook has allowed researchers from premier Universities to analyze Hadoop system logs. These logs typically record machine usage and Hadoop job performance metrics. The hadoop job records are inserted into a MySQL database for easy analysis (<a href="http://issues.apache.org/jira/browse/HADOOP-3708">HADOOP-3708)</a>. Some students at the RAD Labs have used this database to prove that Machine Learning techniques can be used to predict Hadoop job performance. This is a paper that is not yet published. There is another paper that analyzes the performance characteristics of the Hadoop Fair share scheduler. Most of the abstract of these publications are available <a href="http://radlab.cs.berkeley.edu/wiki/Publications#2009">here</a>.<br /><br />Last, but not the least, <a href="http://www-db.cs.wisc.edu/cidr/cidr2009/Paper_86.pdf">SCADS</a> is a scalable storage system that is specifically designed for social networking type software. It has a declarative query language and supports various consistency models. It supports a rich data model that includes joins on pre-computed queries.Dhruba Borthakurhttp://www.blogger.com/profile/10832366855372649190noreply@blogger.com0tag:blogger.com,1999:blog-8556003786324804973.post-82454087518576028572009-05-24T02:17:00.000-07:002009-05-24T02:29:59.379-07:00Better Late than NeverFor quite a while, I have been thinking on blogging about Hadoop in general and Hadoop distributed file system (HDFS) in particular. Why, you may ask?<br /><br />Firstly, I have been contacted by students from as far as Bangladesh and Fiji asking me questions about HDFS via email. This made me think that disseminating internal details about HDFS to the whole wide world would really benefit a lot of people. I like to interact with these budding engineers; and their questions, though elementary in nature, sometimes makes me really ruminate on why we adopted a particular design and not another. I will sprinkle a few of these examples next week.<br /><br />Secondly, I visited a few Universities last month, among them Carnegie Mellon University and my alma-mater Univ of Wisconsin. On my flight, I was getting bored to death, because I really did not like the movie that was playing and I did not carry any material to read. (Usually I like to read and re-read Sherlock Holmes over and over again.) But like they say, " an idle mind is the devil's workhop".... I started to jot down some exotic design ideas about HDFS.... And lo behold, I have a list of ideas that I would like to share! I will post them next week as well.Dhruba Borthakurhttp://www.blogger.com/profile/10832366855372649190noreply@blogger.com1