Saturday, May 5, 2012

Hadoop and Solid State Drives


Is there a story for the Hadoop Storage Stack (HDFS+HBase) on Solid State Drive (SSD)? This is a question that I have been asked by quite a few people in the last two days, mostly by people at the OpenComputeSummit. This piece discusses the possible use cases of using SSD with Hadoop or HBase.

Use Case
Currently, there are two primary use cases for HDFS: data warehousing using map-reduce and a key-value store via HBase. In the data warehouse case, data is mostly accessed sequentially from HDFS, thus there isn't much benefit from using a SSD to store data. In a data warehouse, a large portion of queries access only recent data, so one could argue that keeping the last few days of data on SSDs could make queries run faster. But most of our map-reduce jobs are CPU bound (decompression, deserialization, etc) and bottlenecked on map-output-fetch; reducing the data access time from HDFS does not impact the latency of a map-reduce job. Another use case would be to put map outputs on SSDs, this could potentially reduce map-output-fetch times, this is one option that needs some benchmarking.

For the secone use-case, HDFS+HBase could theoretically use the full potential of the SSDs to make online-transaction-processing-workloads run faster. This is the use-case that the rest of this blog post tries to address.

Background
The read/write latency of data from a SSD is a magnitude smaller than the read/write latency of a spinning disk storage, this is especially true for random reads and writes. For example, a random read from a SSD takes about 30 micro-seconds while a random read from a spinning disk takes 5 to 10 milliseconds. Also, a SSD device can support 100K to 200K operations/sec while a spinning disk controller can possibly issue only 200 to 300 ops/sec. This means that random reads/writes are not a bottleneck on SSDs. On the other hand, most of our existing database technology is designed to store data in spinning disks, so the natural question is "can these databases harness the full potential of the SSDs"?  To answer the above question, we ran two separate artificial random-read workloads, one on HDFS and one on HBase. The goal was to stretch these products to the limit and establish their maximum sustainable throughput on SSDs.

HDFS random-read on cached data
In the first experiment, we created a HDFS cluster with a single NameNode and a single DataNode. We created a 2 GB HDFS file with a HDFS block size of 256 MB and a replication factor of 1. We configured the DataNode to run on a 16 hyper-threaded cores and it stored block-data on xfs. Our benchmark program was co-located on the DataNode machine and had hdfs-read-shortcircuit swicthed on, i.e. the DFSClient bypassed the DataNode and issued read-calls directly to the local xfs filesystem. The entire 2 GB of data was cached in the OS buffer cache and this benchmark did not trigger any IO from disk. The fact that all the data was in the OS cache essentially simulated the behavior of an ultra-fast SSD. We varied the number of client threads and each client thread did a pre-configured number of 16K read calls from HDFS. Since there were only 8 blocks in the file, the DFSClient cached all the block locations of all these 8 blocks and there were no repeatative calls to the NameNode. The first few iterations of this test showed that HDFS can sustain a max random-read-throughput of around 50K ops/sec, but surprisingly the CPU was not maxed out. We found that the read-shortcircuit code path spent considerable time in DNS lookup calls and updating metric-counters. We fixed these two pieces of code and observed that HDFS could sustain a peak random-read-throughput of around 92K ops/sec, the CPUs was now close to 95% usage. HdfsPreadImage is a plot that captures this scenario. The takeaway is that a database that is layered above HDFS would not be able to utilize all the iops offered by a single SSD.

A profiled run of the HDFS code shows that the DFSClient's code path are quite long and causes appreciable impact to throughput for cached random reads. If data-fetch times are in the millisecond range(from spinning disks), the long code paths in the DFSClient do not add appreciable overhead, but when the data is cached in the OS cache (or in SSDs), these code paths need some major rework. Another option would be to write a HDFS readonly-client in C or C++, thereby avoiding some of the overhead of the current Java-based DFSClient.

HBase random-get on cached data
In the second experiment, we did a similar experiment on HBase. We created a single table with a single region and all data was cached in the OS cache of a single HBase regionserver. The OS cache is simulating a super fast SSD device. We used a set of 4 client machines to drive random-get calls to the regionserver. The regionserver was configured to use a maximum of 2000 threads. The HBase table has lzo compression and delta-encoding-fast-diff enabled. Since the data set is cached in OS buffers, this benchmark does not cause any disk io from spinning disks. We saw that the HBase throughput  maxes out at around 35K ops/sec and we were not able to drive the CPU usage on that machine to more than 45%. Heavy lock contention and heavy context switching causes the regionserver to not be able to use all the available CPU on the machine. The detailed chart is at Cache4G.

What does this mean
The two experiments show that HBase+HDFS, as it stands today, will not be able to harness the full potential that is offered by SSDs. It is possible that some code restructuring could improve the random-read-throughput of these solutions but my guess is that it will need significant engineering time to make HBase+HDFS sustain a throughput of 200K ops/sec.

These results are not unique to HBase+HDFS. Experiments on other non-Hadoop databases show that they also need to be re-engineered to achieve SSD-capable throughputs. My conclusion is that database and storage technologies would need to be developed from scratch if we want to utilize the full potential of Solid State Devices. The search is on for there new technologies!


Sunday, February 19, 2012

Salient features for a BigData Benchmark

Recently, I was asked to write up about my vision of a BigData Benchmark. That begs the question: What is BigData? Does it refer to a dataset that is large in size, and if so, what is large? Does it refer to the type of data that such a data store contains? Shall we refer to BigData only if it does not conform to a relational schema? Here are some of my random thoughts.

Software industry professionals have started to use the term BigData to refer to data sets that are typically many magnitudes larger than traditional databases. The largest Oracle database or the largest NetApp filer could be many hundred terabytes at most, but BigData refers to storage sets that can scale to many hundred petabytes. Thus, the first and foremost chracteristics of a BigData store is that a single instance of it can be many petabytes in size. These data stores can have a multitude of interfaces, starting from traditional SQL-like queries to customized key-value access methods. Some of them are batch systems while others are interactive systems. Again, some of them are organized for full-scan-index-free access while others have fine-grain indexes and low latency access. How can we design a benchmark(s) for such a wide variety of data stores? Most benchmarks focus on latency and throughput of queries, and rightly so. However, in my opinion, the key to designing a BigData benchmark lies in understanding the deeper commonalities of these systems. A BigData benchmark should measure latencies and throughput, but with a great deal of variations in the workload, skews in the data set and in the presence of faults. I list below some of the common characteristics that distinguish BigData installations from other data storage systems.

Elasticity of resources
A primary feature of a BigData System is that it should be elastic in nature. One should be able to add software and hardware resources when needed. Most BigData installations do not want to pre-provision for all the data that they might collect in the future, and the trick to be cost-efficient is to be able to add resources to a production store without incurring downtime. A BigData system typically has the ability to decommission parts of the hardware and software without off-lining the service, so that obselete or defective hardware can be replaced dynamically. In my mind, this is one of the most important features of a BigData system, thus a benchmark should be able to measure this feature. The benchmark should be such that we can add and remove resources to the system when the benchmark is concurrently executing.

Fault Tolerance
The Elasticity feature described above indirectly implies that the system has to be fault-tolerant. If a workload is running on your system and some parts of the system fails, the other parts of the system should configure themselves to share the work of the failed parts. This means that the service does not fail even in the face of some component failures. The benchmark should measure this aspect of BigData systems. One simple option could be that the benchmark itself introduces component failures as part of its execution.

Skew in the data set
Many big data systems take in un-curated data. That means there are always data points that are extreme outliers and introduces hotspots in the system. The workload on a BigData system is not uniform; some small parts of it is are major hotspots and incur tremendously higher load than the rest of the system. Our benchmarks should be designed to operate on datasets that have large skew and introduce workload hotspots.

There are a few previous attempts to define a unified benchmark for BigData. Dewitt and Stonebraker touched upon a few areas in their SIGMOD paper. They describe experiments that use a grep task, a join task and a simple sql aggregation query. But none of those experiments are done in the presence of system faults, neither do they add or remove hardware when the experiment is in progress. Similarly, the YCSB benchmark proposed by Cooper and Ramakrishnan suffers from the same deficiency.

How would I run the experiments proposed by Dewitt and Stonebraker? Here are some of my early thoughts:
  • Focus on a 100 node experiment only. This is the setting that is appropriate for BigData systems.
  • Increase the number of URLs such that the data set is at least a few hundred terabytes.
  • Make the benchmark run for at least an hour or so. The workload should be a set of multiple queries. Pace the workload so that the there is constant fluctuations in the number of inflight queries.
  • Introduce skew in the data set. The URL data should be such that maybe 0.1% of those URLs occur 1000 times more frequently that other URLs.
  • Introduce system faults by killing one of the 100 nodes once every minute, keep it shutdown for a minute, then bring it back online and then continue with process with the remainder of the nodes till the entire benchmark is done.
My hope is that there is somebody out there who can repeat the experiments with the modified settings listed above and present their findings. This research would greatly benefit the BigData community of users and developers!

On a side note, I am working with some of my esteemed colleagues to document a specific data model and custom workload for online serving of queries from a multi-petabyte BigData system. I will write about it in greater detail in a future post.