A New Rating System for the Data Center Could Prevent Outages
Over the last 12 months, numerous high-profile data center outages have been observed at the likes of Level3, Telecity, Colo4, Equinix and NTT, resulting in hundreds of businesses going offline and hitting companies that use the affected locations not only for hosting, but also for connectivity. These outages are the latest in a line of slip-ups for data center providers and cloud-computing firms, but could this all have been prevented? The incidents stemmed from a variety of issues, from power outages in the case of Level3 and Telecity, to electrical component failures (in the case of Colo4), to networking issues at NTT. In some cases the outages were so severe that they physically damaged equipment.
Global infrastructure also means global outages, and historical statistics have shown that most data center outages are the result of human error. Some of the recent high-profile outages, however, were caused by failure on the infrastructure side; all of these outages are avoidable. Today a number of systems are used globally to classify the performance and availability of data center facilities, but not all focus on the same components. Most data center rating systems tend to focus on the mechanical and electrical infrastructure of the facility. Although important, the operations and maintenance—including the human-error factor—are equally large contributors to the continuous operation of data centers.
Beyond the physical infrastructure lies the information technology (IT) layer, from the physical to the logical layers. All of these layers provide the ultimate goal to any organization of business continuity and availability. Conducting an assessment of just the mechanical and electrical infrastructure does not address all the variables and makes any risk assessment inaccurate. For example, the loss of a UPS system may cause a downtime for a particular data center, but not necessarily for the business as a whole. And this will also change depending on the purpose of the facility, as an enterprise data center is different from a colocation or hosting data center, and from a cloud provider (for public clouds).
A better understanding of the risk-profile and the fact that not all data centers have been created equal may have prevented outages in recent examples above, which is why the tiering system currently adopted needs to fundamentally change. The current data center rating systems are unfit for use in the new world of IT delivery, especially where IT purchases are delivered over the cloud. With just four distinct levels to distinguish basic physical infrastructure design from the most stringent, fail-proof facility offering 99.995% service availability, a more comprehensive method is needed to help better understand a data center’s resilience.
PTS Consulting, for example, has been lobbying for change and introduced its own proprietary STARS methodology and tool to more holistically assess availability of data center facilities. STARS evaluates a data center’s resilience, redundancy and scale of infrastructure along with operational maturity of the site; it is different from other systems in the industry and has 21 assessment levels; providing a more granular understanding and analysis of a data center’s infrastructure.
In the situations outlined previously, the STARS system could have identified areas of weakness and warned the owners that they would be unable to meet the promised SLAs. A holistic approach to the business and services configuration and infrastructure is the best way to rate a data center or group of data centers, rather than a focus on just the mechanical and electrical infrastructure of an individual data center.
The few rating/certifying bodies in use today tend to focus on the individual data center with no consideration of its operation as part of a group (in some cases), and within that scope, they only focus on the mechanical and electrical infrastructure. Some systems have taken into consideration the operations and maintenance and some aspects of the physical IT layer, but none captures all in a unified rating system that addresses the key question: what is the organizational risk?
We think it is time for the data center assessment system to change, putting in its place a more robust solution that could make outages a thing of the past—now, wouldn’t that be nice?
Leading article photo courtesy of clayirving