October 29, 2007
I’ve spent the past few days trying to develop a simple mathematical model to predict the expected availability of complex systems. In IT, we are often asked to develop and commit to service level agreements (SLAs). If the points of failure of the system are not analyzed, and then the system availability calculated, the SLA is flawed from the beginning. To complicate matters further, different people have different definitions of availability. For instance, does scheduled downtime for maintenance count against your system availability calculation?
Common Availability Definitions:
- Availability = MTBF/(MTTR+MTBF) (Mean Time Between Failure, Mean Time To Recover). This is a classic definition of availability and is often used by hardware manufacturers when they publish an availability metric for a given server.
- Availability = (Uptime + Scheduled Maintenance)/(Unscheduled Downtime + Uptime + Scheduled Maintenance). This is an IT centric availability metric where the business can support scheduled downtime after hours. This model works for some types of systems, such as a file server that isn’t needed at night, but it doesn’t work as well for websites, even though many web companies still use this for their SLAs.
- Availability = Uptime/(Uptime + Downtime). This metric best applies to systems that are needed 24×7 such as e-commerce sites.
Availability is most often expressed as a percentage. Sometimes, people will refer to “four nines” (99.99%) or “five nines” (99.999%). To simplify things, the following table shows the minutes of downtime allowed per year for a given availability level:
Based on the above table, you can see there is a big difference between an SLA specifying 99% availability (88 hours of downtime per year) and 99.9% availability (8.8 hours of downtime per year). But how can we be sure what the expected downtime of the system is? In the most simplistic form, the expected availability of a system equals the expected availability of each of the components of the system multiplied together. So if the system was composed of two servers, and each server had an expected availability of 99%, then the expected availability of the system would be 99% * 99% = 98.01%. Note: I have used the term “expected availaibilty.” We are calculating a future expectation of the system over an extended period of time, not a historical availability. For the rest of this article, I will drop the term expected for brevity, but it is always implied.
The simplistic model above is useful for illustrating the point that downtime is cumulative. In other words, if I expect each component to be down 88 hours/year, and a failure of either component is a failure of the system, then the system has an expected downtime of 174 hours. Why not 176 hours? Well, occasionally both components will be down at the same time.
Real world systems are never this simple. Typically, the system will be comprised of multiple components, some with redundancy, and each with different levels of component availability. Modeling these requires slightly more complicated formulas, but once you have the concepts down, then the actual calculations can be quickly performed in a spreadsheet. Before we move forward, we need some basic notations to simplify our formulas:
Availability Component 1 = Ac1
Availability Component 2 = Ac2
Availability Component 3 = Ac3
Availability Component n = Acn
Availability System = As
Now that that is out of the way, we are ready for our first formula. When a system is made up of n number of components that are each single points of failure then the system availability can be calculated as:
EQUATION #1: As = Ac1 * Ac2 * Ac3 * …Acn
Consider a 24×7 e-commerce site with lots of single points of failure. We might model the site as having the following eight components:
If any of these components fails, the website will crash. The expected availability of the site would be 85%*90%*99.9%*98%*85%*99%*99.99%*95% = 59.87%. Note that we are modeling each component as a whole, rather than looking at its parts. We could break down the web service into software (Apache), code (our web site), and hardware (motherboard, hard drives, etc.). For our purposes, complexity does not necessarily improve the model, so we will deal with the service as a whole. Also, for this discussion we will use the third definition of availability above. To our users, it does not matter if the site is down because of maintenance or a hard drive failed.
Assuming we want to keep our jobs, we need to figure out a way to increase this availability. Two obvious targets for improving the site stability would be the web service and the firewall. The question is, what effect would adding another web server have on the availability of this service. This leads us to our second equation. When a system is comprised of two redundant components, then the availability of the system can be calculated as:
EQUATION #2: As = Ac1 + ((1 – Ac1) * Ac2)
Using our example of a web server with an availability of 85%, then adding a second server would increase the availability to: 85% + (1-85%)*85% = 97.75%. The logic behind this is that when the 1st server is down (15% of the time), the second server is still up 85% of the time. This may or may not translate into actual real world availability. For instance, if the web server was down so often because we constantly need to take it offline to deploy new code, then adding a second server should translate to higher availability, because we could deploy code to one offline server, while the other server stays up. In this case, our real world availability increase might be greater than 12.75%. Conversely, if our service is down due to code bugs, then adding a second server could in some cases make the availability worse by exasperating the bug.
The point is that in general if you’ve accurately estimated the availability of the component, then the equation will work. Note also that the equation works even if the components have unequal availability estimates. Assume that the web server has an availability problem because the hardware is undersized. Now assume the second server we purchase has twice the capacity, and we determine that by itself the availability of the new server would be 90%, then our equation changes to: 85% + (1-85%)*90% = 98.5%.
So let’s add this back into our system calculation above. Assume that we add a second web server, and a second firewall, increasing the availability of each component system to 97.75%. Now, the availability of our system would be: 97.75%*90%*99.9%*98%*97.75%*99%*99.99%*95% =79.10%. Better, but still not great. It is hard to achieve any level of high availability when you have single points of failure. So let’s assume that we add a redundant component for all our servers and network equipment. Let’s also assume we add a second ISP for carrier diversity, but we are still within one physical data center. Our availability equation would now be: 97.75%*99%*99.9999%*99.96%*97.75%*99.99%*99.99%*99.75% = 94.3%. Getting better. Removing single points of failure improved our availability of the system from 60% (3506 hours of downtime/year) to 94.3% (500 hours of downtime/year).
Equation #2 above modeled adding a single redundant component. In some cases, we may add more than one redundant component. For example, we may have more than two web servers. In this case, we need to iterate through equation #2 multiple times to find the effect of the additional components, which brings us to our third equation. When trying to calculate the availability of a service with n number of redundant components, we calculate it as:
EQUATION #3: As = Ac(n-1) + ((1 – Ac(n-1)) * Acn)
In the case of our web service, adding a 3rd server would change the availability to: 97.75% + (1-97.75%)*85% = 99.6625%. Adding a fourth server would increase availability to: 99.6625% + (1-99.6625%)*85% = 99.949%. Note that there is a diminishing level of return. Adding a 2nd server increased availability by 12.75%. Adding a 3rd server only gained us 1.9125%. The fourth server bought us a paltry .2865%. And yet, even with 3 more servers than we need to service our load, we still have not yet achieved the elusive four nines of availability. Designing a highly available system requires us to make individual components highly available AND add redundancy of components. If the individual web server in our example had an availability of 90%, instead of 85%, then the availability of two servers would be 99% and three servers would be 99.99%.
Equations 2 & 3 have a simple flaw, in that they assume a single component can handle the load, and that the load is constant. What if under normal operations, one web server can handle the load, but under peak, we need three servers? Then our availability for three servers under normal load would be 99.775%, but under peak load, the availability would drop back to 85%. Under peak load, the failure of one server would cause a loss of the service, so we drop back to the availability of a single box. What if our peak load required 2 servers? In this case, the availability under peak would be 97.75%. If peak requires two servers, and we have three, then we can lose one server and still be operational, therefore our availability is the equivalent of having two servers. The important concept here is there is an inverse relationship between load and availability.
What should be obvious by now, is that achieving truly high levels of availability (99.9% – 99.999%) is very difficult and very expensive. One of the most expensive single points of failure to eliminate is the data center. In most cases, this effectively doubles the cost of your infrastructure, and the cost may be even greater than 2x, because you will often need to invest in technology to keep the servers in each data center in synch with each other.
Consider, however, the impact of adding a fully redundant data center. In our example above, the availability of our data center will redundant servers and ISPs was 94.3%. Adding a second data center with the necessary technology to make both centers work active-active (both data centers take traffic at the same time) would increase our availability to: 94.3% + (1-94.3%)*94.3% = 99.675%. Adding a second data center saved us 471 hours of downtime per year!
In this example, we assumed that each data center was an independent system, so a failure of a service in one data center would be a failure of the entire system in that data center. This is not always the case. For instance, if properly designed, a web server in one data center could connect to a database server in the other data center. In this case, the expected availability of the system would be higher that 99.675%. If you were able to design your site so that each service operated independent of the other services, then the availability in our example would increase from 99.675% to 99.888% (each service would have 3 redundant components, except the data center, which would have 1 redundant component).
These formulas are much easier to play with in Excel. Paste the following table into a spreadsheet, starting at cell A1:
|System Avail %||=b2*b3*b4*b5
Now that you have the basic concepts down, and the beginnings of a spreadsheet for calculating changes to our assumptions, you can focus on applying these theories to your unique situation. Start by breaking down your system, be it a web site, an accounting system, or a file server, into individual component services. For each service, determine the minimum number of units required for the system to work, and the expected availability of a unit.
Estimating availability can be a challenge. One method would be to look at historical data. If you don’t have access to good data, you can form an estimate based upon your standard operating parameters. For instance, if you release new code to your web server twice per month, and each release causes 2 hours of downtime, that would translate to 48 hours of downtime per year. If you expect to perform operating system maintenance once per quarter, at an estimated downtime of 2 hours per quarter, that would equal another 8 hours per year. If you also anticipate one hardware failure per year, and had a next business day warranty, that would translate to on average 41 hours of downtime per year (Friday outages get repaired on Monday. Saturday and Sunday outages get repaired on Tuesday). Adding these numbers up, we get: 48 + 8 + 41 = 98 hours of downtime per year, or an estimated 98.882% availability.
With a little work, you can estimate a realistic level of availability of your system. This is the cornerstone of creating realistic and attainable SLAs. These formulas can help IT negotiate SLAs with the business, and can help determine the comparative ROI of different solutions. For instance, let’s say you were trying to choose a web server solution, and you had two choices:
- Choice number 1 consists of 4 servers using cheap hardware with no internal redundancy. Each server costs $3,000. You estimate the availability of each server to be 75%.
- Choice number 2 consists of 2 servers using expensive hardware with redundant hard drives and power supplies. Each server costs $20,000. You estimate the availability of each server to be 99%.
You estimate the cost of downtime to be $500/hour, and you expect these servers to support your site load with a single server for the next 3 years, after which they will be replaced. Using the above numbers, Solution #1 has an expected availability of 99.6%, at a cost of $12,000. Solution #2 has an expected availability of 99.99% at a cost of $40,000. Solution #1 would experience 34 hours/year, or 102 hours over three years of downtime more than solution #2. Over three years, this extra downtime would cost $51,000. So by spending $28,000 upfront, you would get a three year ROI of 182%. Note that the model is only as good as your estimates. If the servers in solution #2 only had 95% availability, then their combined availability would be 99.75%, which would only provide 13 hours less downtime per year. In this case, you would only save $20,000 in downtime over three years for your $28,000 investment, so you would be better off with solution #1.
Designing and operating highly available systems is complicated work, but with a few simple formulas, it is possible to understand and predict their behavior at a macro level. This will allow you to make better decisions when choosing between multiple options, and give more realistic predictions when negotiating SLAs.