Saturday, May 5, 2012

Hadoop and Solid State Drives


Is there a story for the Hadoop Storage Stack (HDFS+HBase) on Solid State Drive (SSD)? This is a question that I have been asked by quite a few people in the last two days, mostly by people at the OpenComputeSummit. This piece discusses the possible use cases of using SSD with Hadoop or HBase.

Use Case
Currently, there are two primary use cases for HDFS: data warehousing using map-reduce and a key-value store via HBase. In the data warehouse case, data is mostly accessed sequentially from HDFS, thus there isn't much benefit from using a SSD to store data. In a data warehouse, a large portion of queries access only recent data, so one could argue that keeping the last few days of data on SSDs could make queries run faster. But most of our map-reduce jobs are CPU bound (decompression, deserialization, etc) and bottlenecked on map-output-fetch; reducing the data access time from HDFS does not impact the latency of a map-reduce job. Another use case would be to put map outputs on SSDs, this could potentially reduce map-output-fetch times, this is one option that needs some benchmarking.

For the secone use-case, HDFS+HBase could theoretically use the full potential of the SSDs to make online-transaction-processing-workloads run faster. This is the use-case that the rest of this blog post tries to address.

Background
The read/write latency of data from a SSD is a magnitude smaller than the read/write latency of a spinning disk storage, this is especially true for random reads and writes. For example, a random read from a SSD takes about 30 micro-seconds while a random read from a spinning disk takes 5 to 10 milliseconds. Also, a SSD device can support 100K to 200K operations/sec while a spinning disk controller can possibly issue only 200 to 300 ops/sec. This means that random reads/writes are not a bottleneck on SSDs. On the other hand, most of our existing database technology is designed to store data in spinning disks, so the natural question is "can these databases harness the full potential of the SSDs"?  To answer the above question, we ran two separate artificial random-read workloads, one on HDFS and one on HBase. The goal was to stretch these products to the limit and establish their maximum sustainable throughput on SSDs.

HDFS random-read on cached data
In the first experiment, we created a HDFS cluster with a single NameNode and a single DataNode. We created a 2 GB HDFS file with a HDFS block size of 256 MB and a replication factor of 1. We configured the DataNode to run on a 16 hyper-threaded cores and it stored block-data on xfs. Our benchmark program was co-located on the DataNode machine and had hdfs-read-shortcircuit swicthed on, i.e. the DFSClient bypassed the DataNode and issued read-calls directly to the local xfs filesystem. The entire 2 GB of data was cached in the OS buffer cache and this benchmark did not trigger any IO from disk. The fact that all the data was in the OS cache essentially simulated the behavior of an ultra-fast SSD. We varied the number of client threads and each client thread did a pre-configured number of 16K read calls from HDFS. Since there were only 8 blocks in the file, the DFSClient cached all the block locations of all these 8 blocks and there were no repeatative calls to the NameNode. The first few iterations of this test showed that HDFS can sustain a max random-read-throughput of around 50K ops/sec, but surprisingly the CPU was not maxed out. We found that the read-shortcircuit code path spent considerable time in DNS lookup calls and updating metric-counters. We fixed these two pieces of code and observed that HDFS could sustain a peak random-read-throughput of around 92K ops/sec, the CPUs was now close to 95% usage. HdfsPreadImage is a plot that captures this scenario. The takeaway is that a database that is layered above HDFS would not be able to utilize all the iops offered by a single SSD.

A profiled run of the HDFS code shows that the DFSClient's code path are quite long and causes appreciable impact to throughput for cached random reads. If data-fetch times are in the millisecond range(from spinning disks), the long code paths in the DFSClient do not add appreciable overhead, but when the data is cached in the OS cache (or in SSDs), these code paths need some major rework. Another option would be to write a HDFS readonly-client in C or C++, thereby avoiding some of the overhead of the current Java-based DFSClient.

HBase random-get on cached data
In the second experiment, we did a similar experiment on HBase. We created a single table with a single region and all data was cached in the OS cache of a single HBase regionserver. The OS cache is simulating a super fast SSD device. We used a set of 4 client machines to drive random-get calls to the regionserver. The regionserver was configured to use a maximum of 2000 threads. The HBase table has lzo compression and delta-encoding-fast-diff enabled. Since the data set is cached in OS buffers, this benchmark does not cause any disk io from spinning disks. We saw that the HBase throughput  maxes out at around 35K ops/sec and we were not able to drive the CPU usage on that machine to more than 45%. Heavy lock contention and heavy context switching causes the regionserver to not be able to use all the available CPU on the machine. The detailed chart is at Cache4G.

What does this mean
The two experiments show that HBase+HDFS, as it stands today, will not be able to harness the full potential that is offered by SSDs. It is possible that some code restructuring could improve the random-read-throughput of these solutions but my guess is that it will need significant engineering time to make HBase+HDFS sustain a throughput of 200K ops/sec.

These results are not unique to HBase+HDFS. Experiments on other non-Hadoop databases show that they also need to be re-engineered to achieve SSD-capable throughputs. My conclusion is that database and storage technologies would need to be developed from scratch if we want to utilize the full potential of Solid State Devices. The search is on for there new technologies!


31 comments:

  1. Nice write-up Dhruba!

    Have you tried benchmarking write performance in case of Hbase? SSD writes are slower than reads with GC, random writes are worse. It would be good to know your thoughts on that, and if that aspect has been experimented with.

    ReplyDelete
  2. Thanks Prashant. I have not yet benchmarked write performance of a database on SSD. Will do so in a future date and report here.

    ReplyDelete
  3. But 35K ops of HBase is how many disk ops? There's got to be io amplification, right, especially on the write path? I can't imagine 1 op of HBase is exactly 1 op on the disk. Most database technologies have a 1:5 to 1:10 amplification.

    ReplyDelete
  4. IO amplifications come from primary (1-level) and secondary (2-level) index lookups and updates, transaction (commit) logs, defrag of data, index rebuilds (defrag of indexes), multi-versioning (as in SSTables and Oracle undo logs), checkpoints, archiving, WAN replication, and so on. There are so many foreground and background IOs issued by systems to reliably support a single user op with high performance.

    ReplyDelete
  5. Isnt the I/O amplification heavily workload dependent? For e.g. consider an insert only workload, with the key being an increasing seq#, and only key based single row lookups. Isnt the IO amplification for HBase (for reads) close to 1?

    ReplyDelete
  6. I agree that IO amplification applies to writes and is workload dependent. My above experiment is for a pure read-workload. I am looking for a database that does a read-heavy workload on a SSD.

    The 35K ops to hbase can potentially result in 35K iops to the disk, especially if the workload is purely random-read. For a pure random-read workload, the hbase in-RAM cache does not have any role to play. Does this make sense?

    ReplyDelete
  7. If we consider a HDFS that adds storage device type information to volume and block metadata, and an extension of the HDFS API to specify storage device affinity, then we might see an HBase that stores data to spinning media, but WALs and flush files and other short lived and frequently accessed objects to SSD. Do you think we may see enough benefit with this, plus the (theoretical) improvements you mention, to make a hybrid storage architecture with Hadoop+HBase+SSD+SATA make some sense?

    ReplyDelete
  8. ... of course I am not talking about a pure (cold) read workload, but something more akin to FlashStore: http://www.vldb.org/pvldb/vldb2010/papers/I04.pdf

    ReplyDelete
  9. Thanks for your comments Andrew. It is very much possible to make HDFS expose APIs so that data can go to spinning disks and other metadata objects to SSD. It will possibly improve HBase performance no doubt. And it is worth doing. The one question that I am still doubtful is that if we get to machines with more and more cores, than some of our existing database technologies might not be able to effectively utilize all the cores.

    ReplyDelete
  10. Following on to Andrew's comment about hybrid storage, one potential setup would be to store the primary replica of a file on SSD and the 2 backup replicas on spinning disks. The write path in hbase is always sequential where spinning media does ok. If the regionserver can work with hdfs to localize and read from the copy of a file that is on an SSD, then you get most of the benefits of the SSD at a third of the cost.

    You could then go further and get more granular about what's stored on the SSD's: certain tables, certain regions, non-datablocks, evicted block cache entries, etc. But, seems like storing just the primary replica on SSD could buy you a lot relative to the implementation complexity.

    ReplyDelete
  11. Agree with you Matt. A hybrid storage reduces the storage cost compared to a pure-SSD based storage. But the thing is that an application can only do a few thousand 16K random read from a HDFS datanode. So, you won't be able to utilize the max potential of the iops offered by the SSD.

    ReplyDelete
  12. Even if it's poorly utilized, could the SSD still be a big improvement? Maybe you only get 4k reads/second, but if your app was previously bottlenecked by the disks at 400 reads/second then it could make a huge difference to the overall application.

    ReplyDelete
  13. Agree again, Matt. A HBase workload that currently runs on spinning disks could run much better if/when the data is served from SSDs. But when it is time to benchmark/compare HBase with some other database on SSD, then there is lots to be done.

    ReplyDelete
  14. Thanks for the article Dhruba.
    If the SSD could be shared so that a handful of servers would share the 200K+ IOPS of a SSD (thereby ammortizing the cost), then do you thing that a jump from 400 IOPs (from HDDS) to 35-92K IOPS on a SSD would help a real workload on Hbase/HDFS?

    ReplyDelete
  15. The last comment was by me (Sujoy)...not sure why it showed up as unknown.

    ReplyDelete
  16. @Sujoy: you are absolutely right. In fact, we currently run multiple servers instances per SSD just to be able to utilize all the IOPs. This is kindof-a-poor man's solution to the problem. Also, you have to have enough CPU power on the server to be able to drive multiple database instances on the same machine.

    ReplyDelete
  17. I am really amazed by this site and really gonna share this site to my friends.

    ReplyDelete
  18. Dhruba,

    Have you tried MapR? Definitely, won't help improve HBase numbers, but should improve raw DFS IO and reduce CPU%%.

    ReplyDelete
  19. " We found that the read-shortcircuit code path spent considerable time in DNS lookup calls and updating metric-counters. We fixed these two pieces of code and observed that HDFS could sustain a peak random-read-throughput of around 92K ops/sec, the CPUs was now close to 95% usage. "

    Have you committed your changes back to Apache?

    ReplyDelete
  20. @Vladimir: I have not tried MapR. And we have not yet committed the above-mentioned-change to Apache hdfs.

    ReplyDelete
  21. I am relatively new to HDFS administration. I am really struggling to find good information on ideal HDFS block sizing to support the efficient operation of MapReduce jobs as well as HBase datafile/index storage. Any suggestions on good resources? Thanks
    Jon

    ReplyDelete
  22. Any updates on this in the months since May? I am considering SSD for Hadoop but this basically tells me not to bother...

    ReplyDelete
  23. Hi Alex ,the FB version of HDFS (https://github.com/facebook/hadoop-20) has some fixes that improves random read performance of local short-circuit reads from a mere 92K ops/sec (when I measured it) to a 1million reads/sec. I do not know what those particular fixes are, but I am assuming that HDFS will make some significant improvements to latency going forward.

    ReplyDelete
  24. Quick question on HBase benchmark - you said 45 % cpu with 30K ops per second. Did you try with 2 region server processes to see if you could push this to 70K qps and cpu to 90 %. Running two processes would help you get some freedom from concurrency issues.

    Another comment - is the workload batched multi gets for HBase ?

    ReplyDelete
  25. This comment has been removed by the author.

    ReplyDelete
  26. Dhruba,

    This is one of the most informative, precise, concise and well written posts on Hadoop/SSD topic. Kudos and thanks for sharing!

    What are your thoughts on the following regarding harnessing SSDs smartly for Hadoop:

    1. Would server-side SSD caching (read only I/Os) be a better and cost-effective performance boost alternative to using SSD as Hadoop/HBase storage device? This could potentially avoid SSD write pitfalls.

    2. Regarding your observation "In a data warehouse, a large portion of queries access only recent data...Another use case would be to put map outputs on SSDs, this could potentially reduce map-output-fetch times" - well, again, would it make more sense to deploy SSD read-only cache for frequently accessed 'recent' Map output instead of using SSD as a storage device for that data?

    3. Another aspect to look at is - Let's assume that SSD itself (whether used as storage or cache) brings in 5X I/O performance boost over HDD when Map output is harnessing SSD. In the overall Hadoop workflow and scheme of things - does it matter? Is I/O boost an incremental performance boost to overall Hadoop application or does it translates to say ~5X overall application performance boost?

    Look forward to your response.

    Thanks.

    ReplyDelete
  27. 1. yes
    2. yes
    3. I do not think that the improvement in overall job latency will decrease by 5X. The copy phase is something that is a big bottleneck that could be partially addressed by a fast storage for map outputs.

    ReplyDelete
  28. This comment has been removed by the author.

    ReplyDelete
  29. This comment has been removed by a blog administrator.

    ReplyDelete
  30. Very nice article Dhruba. Hadoop as is will not get much speed up using SSD as you described, HBase may be slightly better. But I see Spark / Shark should get a significant speed up using SSD.
    I wanted to get your thoughts on Spark / Shark

    ReplyDelete