January 23, 2012
Choosing a data center is a big decision for most companies. Your IT infrastructure represents a critical asset for your company, and unless you are an uber-dot com company like Google or Facebook (which spread their gear around the country in tens of locations), you probably only have one or two data centers. Changing data centers is expensive and time consuming, so choosing the right data center partner is incredibly important.
Unfortunately, data centers don’t make it easy on you to differentiate between them. Everyone says they are “secure,” “highly available,” and “high density.” They all show you their generator farms, their battery rooms, and their security vestibules with bullet proof glass. Tour any three data centers and you’ll be left scratching your head trying to figure out what the difference is. As a result, many people end up using price and proximity as the primary decision points. Or even worse, they look at non-material amenities like free sodas and xboxes in the break room as the deciding factor.
There are critical differences, however, between data centers. Failing to recognize them can cost you more in the long run than any savings you might glean by choosing the low-cost provider. Having purchased services from a multitude of data centers over the last two decades, and having dealt with even more as an IT consultant, I’ve learned to recognize some of the hard to spot differences that can make or break a long term data center relationship. For simplicity (so you can copy/paste into your next RFP), I’ve listed the 10 questions you should ask your next data center below. A detailed explanation of each question follows, so you know what you should look for. I hope you find this list informative.
10 questions to ask your next data center provider
- Which components of the data center facility are both fault tolerant and concurrently maintainable?
- How are cooling zones provisioned to maintain operating temperatures during maintenance or failures of CRAC/CRAH units?
- What are the average and maximum power densities of the facility on a watts/sq’ and watts/cabinet basis?
- How often does the data center load test its generators?
- What are the highest risk natural disasters for the area, and what has the data center done to mitigate their impact?
- What are the minimum skill sets of the remote hands and eyes staff?
- Does the data center maintain multiple redundant sources of fuel and water?
- What certifications has the data center earned, and do they undergo annual audits to maintain them?
- How does the data center track SLA compliance, and what is their historical track record? Can they provide their last 5 failure reports?
- What is the profile of their top 5 clients, and what percentage of total revenue for the facility do they represent?
Which components of the data center facility are both fault tolerant and concurrently maintainable?
Many/most data centers claim to be N+1 or N+2 redundant, meaning they have redundant components. Often times, they will refer to themselves as “tier III” or “tier IV,” referring to the Uptime Institute’s definitions of the four tiers of data centers. Tier III is defined as being concurrently maintainable: “A Concurrently Maintainable data center has redundant capacity components and multiple distribution paths serving the computer equipment. Only one distribution path is required to serve the computer equipment at any time.” Tier IV is defined as being fault tolerant: “A Fault Tolerant data center has multiple, independent, physically isolated systems that provide redundant capacity components and multiple, independent, diverse, active distribution paths simultaneously serving the computer equipment. The redundant capacity components and diverse paths shall be configured such that “N” capacity is providing power and cooling to the computer equipment after any infrastructure failure… Complementary systems and distribution paths must be physically isolated from one another (compartmentalized) to prevent any single event from simultaneously impacting both systems or distribution paths.”
Sounds impressive, right? Surely a Tier IV data center has all the redundancy required. What they don’t explain is how those components tie together in the real world to deal with an outage that occurs during maintenance. Take for example a data center that has 24 generators, 24 UPSs, and 72 CRAH units. The data center wants to do quarterly maintenance on each of these devices without affecting operations, so they are configured as N+1 or even N+2 for some components. The problem is there isn’t enough time in the quarter to do maintenance on 1 item at a time, so the data center either has to reduce its maintenance cycle or do concurrent maintenance on multiple systems.
The scenario you need to understand is what impact concurrent maintenance has on the fault tolerance of the data center. Pose this scenario to your data center. If they are doing maintenance on generator #1, UPS #2, and CRAH unit #3 at the same time, and they suffer an outage of generator #4, UPS #5 and CRAH unit #6, and they lose utility power, what happens to their customers? Tier IV specifies isolation from a single event, but what you really need protecting your critical IT infrastructure is isolation from multiple simultaneous events.
How are cooling zones provisioned to maintain operating temperatures during maintenance or failures of CRAC/CRAH units?
This question is an off-shoot of the first question. Some data centers will claim N+1 or N+2 of their cooling systems, but that is for the entire room or entire data center. If the CRAH unit near your cage fails, and the redundant CRAH is 500’ away on the other side of the room, it won’t do you any good. Does the data center provision their rooms into cooling zones, and provide sufficient redundancy for each zone, such that it can maintain temperatures for all customers in the event of a failure in each zone? Even better, are they both fault tolerant and concurrently maintainable in each zone (see above)? If they suffer a CRAH unit failure while maintaining other CRAH units, can they still maintain temperatures for all customers?
What are the average and maximum power densities of the facility on a watts/sq’ and watts/cabinet basis?
Many data centers are showing their age. When they were designed 8 or 10 years ago, expected power densities were much lower. Often times, data centers will quote their power density in available watts/cabinet, but what they don’t tell you is watts/sq’. As an example, one major data center provider, Digital Realty Trust claims their data centers support up to 15kW/cabinet but they don’t tell you what cage size that represents. Is that based on a standard 8-tile pitch (32 sq’ per cabinet) or are they spacing the cabinets farther apart (and charging you more) to achieve that power? Most data centers are built to support an average of 100 – 175 W/sq’. More modern data centers are built to an average of 225 W/sq’ and can support individual customers at up to 400 W/sq’, and higher densities with expensive in-row cooling. SwitchNAP in Las Vegas claims they can cool up to 1500 W/sq’ but they don’t tell you how they calculate that number. It is unlikely they actually can cool 48kW/cabinet (1500 W/sq’ * 32 sq’ per cabinet on an 8-tile pitch). More likely, they are only counting the 8-12 sq’ that the cabinet itself occupies and it’s cold row, and not counting the hot rows and dead space. If I ever tour there, that’ll be my first question.
Cabinet power densities are also important. Can the data center provide redundant 3-phase 30A and 50A 208V power circuits to every cabinet, and do they charge a premium for higher density power circuits? If you need power densities beyond 6kW/cabinet, you’ll need 3-phase power or a lot of power cords, and not every data center provides these circuits. Remember to future proof your expectations. 10 years ago, 2kW/cabinet was sufficient to power a full 42U of x86 servers. With today’s multi-core, high density blade servers, 8kW – 10kW is required to power an entire rack. Expect power requirements to continue to climb and make sure your data center has plans to grow densities with you.
————-UPDATE March 12, 2012————–
Had a short Twitter conversation with Mark Thiele, the EVP Data Center Technology for SwitchNAP. He generously took the time to give me the data behind their 1500W/sq’ claims. SwitchNAP can provide up to 28kW per cabinet in their pods, so the 1500W/sq’ is the density inside a pod. It isn’t wall-to-wall in the data center. In Mark’s words, “the critical fact is that we can cool 54 U cabs of high density servers. With row after row next to each other.” I agree and their design is very cool (pun intended).
How often does the data center load test its generators?
Most data centers test their generators regularly by spinning them up, but few put them under load for an extended period of time. Load testing generators is expensive because it burns fuel and because load test banks are expensive pieces of equipment. One way data centers will cut corners is to skip regular load testing, figuring they’ll lose utility power often enough that they’ll get their load testing on production loads. This can have several impacts. First, if they aren’t regularly load testing, then they are likely to only identify generator problems when utility power fails, which is precisely the wrong time to find an issue. Second, generators that are started but not put under load can sometimes collect unburned fuel in the exhaust system. This is also known as “wet stacking.” If this happens too often, generator fires can occur, usually when the generator is put under load due to a utility outage. Ask your data center provider if they put every generator on an extended load test at least quarterly, and preferably monthly.
What are the highest risk natural disasters for the area, and what has the data center done to mitigate their impact?
Every data center is subject to natural disasters, but some are more vulnerable than others. California data centers have to worry a lot about earthquakes. Ask the data center what their seismic zone is and what they’ve done to protect themselves from a major earthquake. Also important is what provisions they’ve taken with their suppliers, most especially their fuel suppliers. If the data center survives a massive earthquake, but the utility power is out and the fuel supplier can’t get to the data center to refuel the generators, the data center will still go dark in 24 – 48 hours. Mid-West data centers have to worry about tornados. East Coast and Gulf Coast data centers have to worry about hurricanes. How have they protected the roof and critical facility infrastructure from wind damage? Can the generator enclosures and water tanks survive a tornado or high-speed flying debris.
Winter blizzards are also a major problem for some data centers. What provisions does the data center have if their staff can’t get to the facility to show up to work due to road conditions? It is critically important that you a) understand what disaster scenarios are likely for the facility; and b) work with your provider to make contingency plans based on likely risks. As an example, in blizzard zones, it may be necessary for you to stock extra spare parts for your equipment on-site, if it is likely that there could be delays getting replacement parts on-site during the winter.
What are the minimum skill sets of the remote hands and eyes staff?
It is an absolute certainty that at some point your equipment will need to be physically touched. Sometimes servers need to be power cycled, hard drives need to be swapped for spares, tapes need to be rotated for backups, and cables need to be moved. You can either drive to the data center yourself, or use their remote hands and eyes services. Many data centers don’t provide this service and recommend you contract with a 3rd party outsourcer, who may not be on-site. For data centers that provide remote hands services, some try to cut corners by having the security guards provide the service. Unless you trust a rent-a-cop with your critical infrastructure, you want to make sure the staff provided by the data center consists of IT professionals. Ask for minimum job requirements and arrange to speak to the service manager (typically the NOC manager) to get a feel for who will be helping you at 2AM. Also, if the data center provides a wide-range of IT services with knowledgeable people, you can reduce the importance of proximity in your decision making process. If your own staff has to visit the data center every week, you’ll want it nearby. But if you can depend on your partner for physical work, you can choose the best provider instead of the local provider.
Does the data center maintain multiple redundant sources of fuel and water?
Data centers always brag about their multiple divergent fiber paths, and their multiple redundant utility power sources, but they rarely talk about their water supplies. Water is just as critical to data centers because it is used to chill the air. Does the data center have divergent sources of water including, if available, their own well? Does their water supply have multiple input valves into the chiller supply? The last thing you want is a multi-million dollar data center getting crippled because a single $100 valve failed to open properly. Additionally, does the facility have multiple sources of fuel? As described in the natural disasters section above, when running on generator the data center will likely need to refuel daily. If their resupplier fails to deliver, what’s their backup plan?
What certifications has the data center earned, and do they undergo annual audits to maintain them?
SSAE 16, PCI, LEED Gold, Energy Star, FISMA, HIPAA, SCIF, Tier IV. There are a multitude of certification programs that cover data centers. Each has its own value and its drawbacks, but they are all useful tools in guiding your data center choice. If you process credit cards, you want your upstream providers to support your PCI compliance by maintaining their own PCI compliance. If you are a financial organization, you need your data center to be SSAE 16 (replaced the old SAS70). If your company is environmentally conscious, especially if you purchase carbon credits to offset your power consumption, you want a highly efficient data center with EnergyStar and LEED Gold certifications. Ask your data center for proof of their certifications. In most situations, they will also share their audit criteria under NDA. This information is invaluable because it represents independent analysis of the quality, reliability and security of the services provided.
How does the data center track SLA compliance, and what is their historical track record? Can they provide their last 5 failure reports?
Most data centers claim high levels of availability, “five 9’s” or above. Some even give a 100% availability guarantee. But all data centers have issues. All systems have a risk of failure so a 100% SLA is lie. They can’t actually guarantee 100% availability, but they can guarantee to pay you a financial penalty if an outage occurs. Regardless of the SLA, all data centers suffer outages whether or not the outages have a net impact on the customers. To understand what you’re buying, you need to dig into the details of the SLA and how it is measured. More importantly, what is the level of transparency the data center provides you as a customer when an issue happens? Are they honest and forthright regarding the root cause and future mitigation? Do they proactively notify you, or do they wait for you to call. Ask to see reports of their last 5 major issues. Ideally, ask for examples of all communications regarding the issues, including initial notifications, update messages, resolution details, and after-action reports. The best way to gauge future behavior is to look at what they’ve done in the past.
What is the profile of their top 5 clients, and what percentage of total revenue for the facility do they represent?
Companies that have a client base that is heavily weighted towards a couple of key customers are vulnerable to financial problems if those clients leave. This is especially important for smaller data center providers who may depend on 1 or 2 anchor tenants. Ask the data center for some demographics on their top 5 or top 10 clients. They may be prevented from disclosing the names, but they should be able to tell you the industry and % of revenue of each client. If their top 5 customers are all in the same industry and represent more than half their revenue, RUN! You are entering a long-term relationship with your provider, and you want to make sure they have the finances to last for the long-term. Their balance sheet and income statement may look great today, but if the loss of a single client could cripple them, think twice. And if their balance sheet and income statements already look bad, take a pass. Remember operating a data center is expensive, and you’ll be the one who suffers if they cut maintenance or support staff to cover the loss of a critical tenant.
Bonus question #11 – Does their physical security include sharks with frickin’ laser beams attached to their heads?
Why? Because it’s be cool if it did. Look under the raised floor sometime. Maybe there’s something interesting down there…
There are plenty of other questions you could ask your potential data center partner, but the above list will help you cut through the marketing spin. If you narrow your selection down with this list and you still have more than 1 acceptable provider, then you can worry about amenities like free sodas and proximity to your offices. If this list is helpful to you, please leave a comment below. If you think I’ve missed a critical question, please let me know.
Editor’s Note: The author William Dougherty is currently the VP of Information Technology for RagingWire Data Centers in Sacramento, CA. In his spare time, he trains sharks for a variety of IT jobs, including data center security. The above article represents his personal views and opinions and does not necessary reflect the views of his employer.