
Top 5 Questions to Ask Flash Storage Vendors
NAND flash storage is the new rage in IT. As only a new technology can, it is quickly becoming the focal point in new solutions architecting. Where once hundreds or thousands of disk drives, dozens of LUN groups, many shelves, RAID types, unit allocation, hot spots, and complex software and tiering had to be managed, all data can be placed into one – or a small few – all-flash arrays and receive amazing speed with little to no tuning or advanced planning. Even for systems with a moderate I/O workload, this new technology can be cheaper once the software elimination, power reduction and administration time is factored in. But not all flash storage solutions are the same.
A new set of terminology comes along with NAND flash, in addition to a new set of pros and most certainly a new set of cons. New flash-based storage vendors are popping up by the dozens. So, what is an IT person to do with all this new technology? Simple: ask questions and when in doubt, test the solution first.
Storage purchases are expected to live in production for at least three to five years. Knowing which companies are likely to still be around, what makes their technology different and what makes flash storage the killer new toy or the aggravating purchase you soon regret, is based on your understanding of this new technology and how each vendor utilizes it.
Question #1: Technology Ownership and Support
What parts of your storage solution have you designed and manufactured and which portions have been purchased from other companies? For each part, who supports it? Are replacement parts stored in a local depot? Explain the support process.
Why you should care
With the existing patent-scape as it is and the amount of time it takes to develop these advanced algorithms, it is easy to see why many flash storage startups have chosen to purchase their flash storage as off-the-shelf solid state drives (SSDs) and are aggregating via software. Be aware of which vendors are flash storage developers and which are third party aggregators. SSDs are designed to be hard drive replacements. They are bootable SCSI devices and that SCSI controller will add unnecessary latency. It also means that parallelization, error correction, wear leveling and garbage collection are not under the control of the full array. Instead they are done as individual parts versus one chassis-aware system.
Enterprise class storage systems should have enterprise class support. You may not want a solution where spare parts are not in a local depot, if the storage vendor has to route support calls to a different vendor or if local folks are not available for upgrades or part replacement.
Question #2: High Availability
Are there any single points of failure in the device? How many flash-aware controllers are there? Does HA require buying two arrays? Does the HA affect the I/O latency or throughput? Does the HA feature/software cost extra? What happens to the I/O performance after a failure (ask for each component in the array)? How do failed parts get serviced (hot swap or downtime)?
Why you should care
In the rush to get products to market and attempting to keep costs under control, it is common for vendors to have solutions that are vastly affected by component failure up to the point of the array itself going offline. An enterprise solution should always be on, lose very little performance after component failure, and allow for full hot-swap capability of every component. Make sure you understand if you have to buy two arrays to get full redundancy, if turning on spanning RAID affects performance or if changing out components requires downtime. Any of those things can leave your system underperforming or leave your data at risk while waiting for the next scheduled maintenance window.
Question #3: Normalization
What are your sustained I/O metrics? What are the metrics in real-world workloads (70/30, etc.)? Are your quoted performance metrics after a calculation like post de-duplication or post-compression? Are your sustained metrics after the normal flash burn-in period? What size I/O is used in the metrics (512 bytes, 4k, 8k, etc.)?
Why you should care
Determining the source, calculations and conditions of vendor supplied metrics is vital in understanding what you are actually buying. Some vendors quote IOPs (I/O’s per second) in 512 bytes and some in 4k. Some vendors only quote read IOPs instead of write or mixed workloads. Other vendors quote post-calculation metrics, meaning that the array requires compression or de-duplication in order to achieve this quoted number. The array cannot meet these numbers alone. Also, most flash storage products will settle into a performance zone lower than start. Make sure you find out the numbers post “burn-in” as that is what you’ll see in production over the coming months and years.
Question #4: Parallelization
How many flash aware controllers are in the solution? How many components is wear leveling, error correction, garbage collection and packet striping done over? Are the controllers SCSI based or custom flash-aware? How is parallelization affected by component failure?
Why you should care
All but a few flash storage vendors are getting to market quickly by reselling third party vendor SSDs. Most have no NAND flash storage engineers or controller logic developers on staff. This means that SCSI controllers are limiting latency and processes like wear leveling and error correction are out of the hands of the part-aggregating vendor. Flash has the ability to perform with incredibly low latencies at incredibly high IOPs. Quick-to-market SSD-based arrays can yield faster-than-disk performance but are mostly considered a transition technology whose time is starting to pass as full chassis-aware flash arrays are coming onto the market.
Question #5: User-Facing Architecture
Does your storage solution require the user to create RAID groups, unit-based LUN groups and use a “Segregation and Aggregation” based architecture model?
Why you should care
Storage architecture has long been based on the Aggregation and Segregation model. Individual storage parts (disks) are aggregated together to service the requested I/O profile. These groups are then commonly segregated to avoid one workload affecting another. This requires someone to collect all of the workload groups, define their I/O profile, choose the number of units to place in their LUN group, choose the RAID for the LUN and then monitor and maintain the system. Common byproducts of this system are hot spot issues relating to data locality and the need to specify workload I/O profiles in advance. It is also common for application developers and database admins to not know what their future I/O profiles will need to be and this then causes additional friction in IT departments.
Distributed Block architecture is the way of the future. Since flash is based on an all-silicon technology with no moving parts, it presents the ability to have each storage location be equally accessible, at the same speed, all of the time. This means that administrators can place any data in any format anywhere on an AFA (all-flash array) and it will always work at the same speed with no tuning or advanced planning. The future is zero-risk performance with almost no setup or tuning. Speed comes with the array as each I/O is striped over all of the components so every I/O goes at the maximum speed of the chassis. Space is then used as space is needed and when more space is required another array is purchased. It sounds crazy but this means that solutions engineers will buy space when they need space instead of buying space to get speed. Most transitional SSD-based solutions still require the Aggregation and Segregation model or internally create a basic RAID 5 like stripe over all of the SSDs causing issues with the wear leveling, error correction, and write cliff optimizations.
Summary
NAND flash storage is a relatively new and quickly growing storage medium that brings wonderful performance to enterprise solutions. Like anything else, new technology comes with a new set of benefits and challenges. Understanding how the technology works and what makes each storage vendors’ solution different is the difference between a 5-year success or a 5-year headache.
About the Author:
For over 15 years, Matt Henderson has been a database and systems architect specializing in Sybase and SQL Server platforms, with extensive experience in high volume transactional systems, large data warehouses and user applications in the telecommunications and insurance industries. Matt is currently an engineer at Violin Memory.
Related Items:
Achieving Low Latency: The Fall and Rise of Storage Caching