162 The hard (spinning) disk is a formidable target for cloud-scale, enterprise software. It allows enterprises to store massive amounts of data at near-tape costs. And it can offer organizations resilience of up to 14 nines of durability, on-premises. But spinning disks aren’t failproof. Sometimes there are issues in the manufacturing process or batch failures. That’s when you really want a storage system that can help your organization recover from such failures effectively — and quickly. And that’s also when you want to work with a solution provider who can support you through the process. Read on for a scary situation one of our customers recently faced — and learn how we kept the wheels turning smoothly. That’s the Scality promise: We aren’t just providing a product, we are there to be a true partner for the organizations we work with! The scenario: A company’s hard drive fails A public sector entity in the APAC region had a six-server RING filled with 8TB HDDs (56x 8TB HDDs per server). They use the RING’s native native S3 capability as a target for their company-wide backups. Everything was going smoothly until one of the organization’s 8TB hard drives failed. The agency contacted Scality’s customer success engineers (CSEs) for help. The good news is that with Scality, a single drive failure isn’t a big concern. Their Scality RING is protected by an 8,4 erasure coding schema, which allows either two entire servers (out of the six) to fail, or four random disks anywhere in the cluster to fail, without any data loss. RING’s rebuild process In a normal hard drive failure scenario, the RING’s internal rebuild mechanism would take over and rebuild the data written to the failed disk on the remaining disks in that server. This rebuild in the RING process has two key advantages: It only rebuilds the data that was written — not the entire disk, so cycles aren’t wasted on rebuilding empty space. It has no bottlenecks of dedicated parity drives that would get hit hard on rebuilds, as it writes the data across all the remaining drives in the server. In this particular case, the customer’s application was writing fairly small objects — the average unique object being sent to the RING was about 300KB. After being split eight ways (for the 8,4 EC schema), the average object written to disk was 37.5KB. Even at 80% full, an 8TB drive would hold over 170 million objects, all of which would have to be rebuilt one by one. Understandably, this would take some time, but it would ultimately be OK. A repeat failure But then, two days later, a second 8TB drive failed on a different server. The first failed disk had almost been fully rebuilt, so the second drive failure wasn’t a major concern — yet. However, given the two failures in close succession — and knowing the pattern of batch drive failures — the customer’s IT team was anxious, examining their other drives for any possible additional failures. Again, RING was able to help by reliably predicting the next drive failures based on their erratic response times and delayed write confirmations. When the third drive failed, our Scality CSE team was pretty certain we were dealing with the dreaded batch failure. The drives weren’t old (they’d been in production about 2 ½ years) and hadn’t been overly stressed with writes/deletes and the occasional read. The only question in our minds was: How big is this batch of failed drives going to be? The answer? Much bigger, as we’d soon find out. Surviving a batch failure: Bouncing back from a 10-drive failure in just three weeks! After the initial two drive failures, the customer experienced eight more drive failures in succession. That’s a total of 10 drive failures in three weeks! The customer’s implementation could handle four simultaneous disk failures anywhere within the RING, but not five, six…and certainly not ten! As the failures accelerated, our engineering team knew they’d have to dig deeper to combat them. Scality RING gives our engineers the ability to scan all the objects that have been written and then report on the total amount of data and number of parity chunks that each object currently has remaining on the RING. Our CSEs took this output at every new disk failure, reviewed the objects most at risk and modified the object rebuild order, prioritizing the rebuilds of objects most at risk. Some rebuild tasks were manually being activated to ensure that objects with least parity overhead were rebuilt as a matter of priority. This manipulation of the rebuild order, which ensured that objects with the lowest number of parity bits remaining, was critically importantl to the customer. The result: No data loss It was sometimes an arduous and manual process, but we achieved the most important goal — data loss was prevented. And when you consider the vast amount of precious data that any public sector entity — let alone any organization — is storing today, well, that’s a significant achievement. Under normal disk failure circumstances, the automated rebuild process is sufficient to repair RING to the nominal (healthy) state. However, with multiple simultaneous disk failures, manual intervention was required to handle circumstances thatwere changing rapidly, sometimes a couple of times per day! Despite a 10-disk batch failure, no data was left behind. We survived. This was truly a real-world test of not just RING’s capabilities, but a testament to our support team’s hard work and dedication. At Scality, protecting your data is our mission and we’re up to the task — even when multiple disks fail. Carpe datum!