October 29, 2007

In Search of Five 9s – Calculating Availability of Complex Systems

I’ve spent the past few days trying to develop a simple mathematical model to predict the expected availability of complex systems. In IT, we are often asked to develop and commit to service level agreements (SLAs). If the points of failure of the system are not analyzed, and then the system availability calculated, the SLA is flawed from the beginning. To complicate matters further, different people have different definitions of availability. For instance, does scheduled downtime for maintenance count against your system availability calculation?

Common Availability Definitions:

  1. Availability = MTBF/(MTTR+MTBF) (Mean Time Between Failure, Mean Time To Recover). This is a classic definition of availability and is often used by hardware manufacturers when they publish an availability metric for a given server.
  2. Availability = (Uptime + Scheduled Maintenance)/(Unscheduled Downtime + Uptime + Scheduled Maintenance). This is an IT centric availability metric where the business can support scheduled downtime after hours. This model works for some types of systems, such as a file server that isn’t needed at night, but it doesn’t work as well for websites, even though many web companies still use this for their SLAs.
  3. Availability = Uptime/(Uptime + Downtime). This metric best applies to systems that are needed 24×7 such as e-commerce sites.

Availability is most often expressed as a percentage. Sometimes, people will refer to “four nines” (99.99%) or “five nines” (99.999%). To simplify things, the following table shows the minutes of downtime allowed per year for a given availability level:

Availability

Min Downtime/Year

Hours Downtime/Year

95.000% 26,298 438
98.000% 10,519 175
98.500% 7,889 131
99.000% 5,260 88
99.500% 2,630 44
99.900% 526 8.8
99.990% 52.6 .88
99.999% 5.26 .088

Based on the above table, you can see there is a big difference between an SLA specifying 99% availability (88 hours of downtime per year) and 99.9% availability (8.8 hours of downtime per year). But how can we be sure what the expected downtime of the system is? In the most simplistic form, the expected availability of a system equals the expected availability of each of the components of the system multiplied together. So if the system was composed of two servers, and each server had an expected availability of 99%, then the expected availability of the system would be 99% * 99% = 98.01%. Note: I have used the term “expected availaibilty.” We are calculating a future expectation of the system over an extended period of time, not a historical availability. For the rest of this article, I will drop the term expected for brevity, but it is always implied.

The simplistic model above is useful for illustrating the point that downtime is cumulative. In other words, if I expect each component to be down 88 hours/year, and a failure of either component is a failure of the system, then the system has an expected downtime of 174 hours. Why not 176 hours? Well, occasionally both components will be down at the same time.

Real world systems are never this simple. Typically, the system will be comprised of multiple components, some with redundancy, and each with different levels of component availability. Modeling these requires slightly more complicated formulas, but once you have the concepts down, then the actual calculations can be quickly performed in a spreadsheet. Before we move forward, we need some basic notations to simplify our formulas:

Basic Nomenclature
Availability Component 1 = Ac1
Availability Component 2 = Ac2
Availability Component 3 = Ac3
Availability Component n = Acn
Availability System = As

Now that that is out of the way, we are ready for our first formula. When a system is made up of n number of components that are each single points of failure then the system availability can be calculated as:

EQUATION #1: As = Ac1 * Ac2 * Ac3 * …Acn

Consider a 24×7 e-commerce site with lots of single points of failure. We might model the site as having the following eight components:

Component

Availability

Web 85%
Application 90%
Database 99.9%
DNS 98%
Firewall 85%
Switch 99%
Data Center 99.99%
ISP 95%

If any of these components fails, the website will crash. The expected availability of the site would be 85%*90%*99.9%*98%*85%*99%*99.99%*95% = 59.87%. Note that we are modeling each component as a whole, rather than looking at its parts. We could break down the web service into software (Apache), code (our web site), and hardware (motherboard, hard drives, etc.). For our purposes, complexity does not necessarily improve the model, so we will deal with the service as a whole. Also, for this discussion we will use the third definition of availability above. To our users, it does not matter if the site is down because of maintenance or a hard drive failed.

Assuming we want to keep our jobs, we need to figure out a way to increase this availability. Two obvious targets for improving the site stability would be the web service and the firewall. The question is, what effect would adding another web server have on the availability of this service. This leads us to our second equation. When a system is comprised of two redundant components, then the availability of the system can be calculated as:

EQUATION #2: As = Ac1 + ((1 – Ac1) * Ac2)

Using our example of a web server with an availability of 85%, then adding a second server would increase the availability to: 85% + (1-85%)*85% = 97.75%. The logic behind this is that when the 1st server is down (15% of the time), the second server is still up 85% of the time. This may or may not translate into actual real world availability. For instance, if the web server was down so often because we constantly need to take it offline to deploy new code, then adding a second server should translate to higher availability, because we could deploy code to one offline server, while the other server stays up. In this case, our real world availability increase might be greater than 12.75%. Conversely, if our service is down due to code bugs, then adding a second server could in some cases make the availability worse by exasperating the bug.

The point is that in general if you’ve accurately estimated the availability of the component, then the equation will work. Note also that the equation works even if the components have unequal availability estimates. Assume that the web server has an availability problem because the hardware is undersized. Now assume the second server we purchase has twice the capacity, and we determine that by itself the availability of the new server would be 90%, then our equation changes to: 85% + (1-85%)*90% = 98.5%.

So let’s add this back into our system calculation above. Assume that we add a second web server, and a second firewall, increasing the availability of each component system to 97.75%. Now, the availability of our system would be: 97.75%*90%*99.9%*98%*97.75%*99%*99.99%*95% =79.10%. Better, but still not great. It is hard to achieve any level of high availability when you have single points of failure. So let’s assume that we add a redundant component for all our servers and network equipment. Let’s also assume we add a second ISP for carrier diversity, but we are still within one physical data center. Our availability equation would now be: 97.75%*99%*99.9999%*99.96%*97.75%*99.99%*99.99%*99.75% = 94.3%. Getting better. Removing single points of failure improved our availability of the system from 60% (3506 hours of downtime/year) to 94.3% (500 hours of downtime/year).

Equation #2 above modeled adding a single redundant component. In some cases, we may add more than one redundant component. For example, we may have more than two web servers. In this case, we need to iterate through equation #2 multiple times to find the effect of the additional components, which brings us to our third equation. When trying to calculate the availability of a service with n number of redundant components, we calculate it as:

EQUATION #3: As = Ac(n-1) + ((1 – Ac(n-1)) * Acn)

In the case of our web service, adding a 3rd server would change the availability to: 97.75% + (1-97.75%)*85% = 99.6625%. Adding a fourth server would increase availability to: 99.6625% + (1-99.6625%)*85% = 99.949%. Note that there is a diminishing level of return. Adding a 2nd server increased availability by 12.75%. Adding a 3rd server only gained us 1.9125%. The fourth server bought us a paltry .2865%. And yet, even with 3 more servers than we need to service our load, we still have not yet achieved the elusive four nines of availability. Designing a highly available system requires us to make individual components highly available AND add redundancy of components. If the individual web server in our example had an availability of 90%, instead of 85%, then the availability of two servers would be 99% and three servers would be 99.99%.

Equations 2 & 3 have a simple flaw, in that they assume a single component can handle the load, and that the load is constant. What if under normal operations, one web server can handle the load, but under peak, we need three servers? Then our availability for three servers under normal load would be 99.775%, but under peak load, the availability would drop back to 85%. Under peak load, the failure of one server would cause a loss of the service, so we drop back to the availability of a single box. What if our peak load required 2 servers? In this case, the availability under peak would be 97.75%. If peak requires two servers, and we have three, then we can lose one server and still be operational, therefore our availability is the equivalent of having two servers. The important concept here is there is an inverse relationship between load and availability.

What should be obvious by now, is that achieving truly high levels of availability (99.9% – 99.999%) is very difficult and very expensive. One of the most expensive single points of failure to eliminate is the data center. In most cases, this effectively doubles the cost of your infrastructure, and the cost may be even greater than 2x, because you will often need to invest in technology to keep the servers in each data center in synch with each other.

Consider, however, the impact of adding a fully redundant data center. In our example above, the availability of our data center will redundant servers and ISPs was 94.3%. Adding a second data center with the necessary technology to make both centers work active-active (both data centers take traffic at the same time) would increase our availability to: 94.3% + (1-94.3%)*94.3% = 99.675%. Adding a second data center saved us 471 hours of downtime per year!

In this example, we assumed that each data center was an independent system, so a failure of a service in one data center would be a failure of the entire system in that data center. This is not always the case. For instance, if properly designed, a web server in one data center could connect to a database server in the other data center. In this case, the expected availability of the system would be higher that 99.675%. If you were able to design your site so that each service operated independent of the other services, then the availability in our example would increase from 99.675% to 99.888% (each service would have 3 redundant components, except the data center, which would have 1 redundant component).

These formulas are much easier to play with in Excel. Paste the following table into a spreadsheet, starting at cell A1:

Avail %

1 Component

2 Components

3 Components

4 Components

Web 85% =B2+((1-B2)*$B2) =C2+((1-C2)*$B2) =D2+((1-D2)*$B2)
Application 90% =B3+((1-B3)*$B3) =C3+((1-C3)*$B3) =D3+((1-D3)*$B3)
Database 99.9% =B4+((1-B4)*$B4) =C4+((1-C4)*$B4) =D4+((1-D4)*$B4)
DNS 98% =B5+((1-B5)*$B5) =C5+((1-C5)*$B5) =D5+((1-D5)*$B5)
Firewall 85% =B6+((1-B6)*$B6) =C6+((1-C6)*$B6) =D6+((1-D6)*$B6)
Switch 99% =B7+((1-B7)*$B7) =C7+((1-C7)*$B7) =D7+((1-D7)*$B7)
Data Center 99.99% Â =B8+((1-B8)*$B8) Â
ISP 95% =B9+((1-B9)*$B9) =C9+((1-C9)*$B9) =D9+((1-D9)*$B9)
System Avail % =b2*b3*b4*b5
*b6*b7*b8*b9
=c2*c3*c4*c5
*c6*c7*b8*c9
=d2*d3*d4*d5
*d6*d7*d8*d9
=e2*e3*e4*e5
*e6*e7*d8*e9

Now that you have the basic concepts down, and the beginnings of a spreadsheet for calculating changes to our assumptions, you can focus on applying these theories to your unique situation. Start by breaking down your system, be it a web site, an accounting system, or a file server, into individual component services. For each service, determine the minimum number of units required for the system to work, and the expected availability of a unit.

Estimating availability can be a challenge. One method would be to look at historical data. If you don’t have access to good data, you can form an estimate based upon your standard operating parameters. For instance, if you release new code to your web server twice per month, and each release causes 2 hours of downtime, that would translate to 48 hours of downtime per year. If you expect to perform operating system maintenance once per quarter, at an estimated downtime of 2 hours per quarter, that would equal another 8 hours per year. If you also anticipate one hardware failure per year, and had a next business day warranty, that would translate to on average 41 hours of downtime per year (Friday outages get repaired on Monday. Saturday and Sunday outages get repaired on Tuesday). Adding these numbers up, we get: 48 + 8 + 41 = 98 hours of downtime per year, or an estimated 98.882% availability.

With a little work, you can estimate a realistic level of availability of your system. This is the cornerstone of creating realistic and attainable SLAs. These formulas can help IT negotiate SLAs with the business, and can help determine the comparative ROI of different solutions. For instance, let’s say you were trying to choose a web server solution, and you had two choices:

  • Choice number 1 consists of 4 servers using cheap hardware with no internal redundancy. Each server costs $3,000. You estimate the availability of each server to be 75%.
  • Choice number 2 consists of 2 servers using expensive hardware with redundant hard drives and power supplies. Each server costs $20,000. You estimate the availability of each server to be 99%.

You estimate the cost of downtime to be $500/hour, and you expect these servers to support your site load with a single server for the next 3 years, after which they will be replaced. Using the above numbers, Solution #1 has an expected availability of 99.6%, at a cost of $12,000. Solution #2 has an expected availability of 99.99% at a cost of $40,000. Solution #1 would experience 34 hours/year, or 102 hours over three years of downtime more than solution #2. Over three years, this extra downtime would cost $51,000. So by spending $28,000 upfront, you would get a three year ROI of 182%. Note that the model is only as good as your estimates. If the servers in solution #2 only had 95% availability, then their combined availability would be 99.75%, which would only provide 13 hours less downtime per year. In this case, you would only save $20,000 in downtime over three years for your $28,000 investment, so you would be better off with solution #1.

Designing and operating highly available systems is complicated work, but with a few simple formulas, it is possible to understand and predict their behavior at a macro level. This will allow you to make better decisions when choosing between multiple options, and give more realistic predictions when negotiating SLAs.

Digg!

Thanks for stopping by.
If you found this article useful, please leave a tip.

20 Comments »

  1. Bill Highleyman said,

    October 31, 2007 @ 1:24 pm

    I really enjoyed your article on five 9s. High availability is a field that I specialize in. You might want to check out the Availability Digest, a free monthly digest of high availability issues, at http://www.availabilitydigest.com. Go to the Article Archives and look under The Geek Corner. There you will find a series of papers on calculating availabity using effectively the same equations you use, though in a different form.

    - Bill Highleyman

  2. bill said,

    October 31, 2007 @ 3:33 pm

    Mr. Highleyman,

    Great tip! For anyone who liked my article, but wants to see the math expressed more eloquently, go immediately to this site and read: http://www.availabilitydigest.com./public_articles/0101/calculating_availability.pdf

    The basic calculations are the same, but his equations are much better than mine.

    -Bill

  3. Chris Loosley said,

    October 31, 2007 @ 5:42 pm

    Bill,
    Nice analysis!

    I responded to your comment on my blog http://www.webperformancematters.com/performance-forum/post/296647 listing some of my own posts on this topic. My analysis was at a much higher level than yours, more like an introduction to the concepts you are working with in your formulas.

    –Chris

  4. Hisham Ghanem said,

    June 15, 2008 @ 7:13 am

    Hi,
    Great article and formulas posed through the site:
    http://www.availabilitydigest.com./public_articles/
    article “Calculating Availability – Redundant Systems”.
    I notived though that the “Multiple Nodes, Multiple Spares” formula has an issue with the (-1)!, when #of nodes = # of spares.

    Did I miss a point when reading the article?

    Thanks.

  5. Bill Highleyman said,

    February 23, 2009 @ 5:12 pm

    Hisham -

    I’m sorry that I am more than a half-year late in responding to your question. I wasn’t aware of this great blog until my researcher found it.

    Your confusion is in the definition of spares. if we have n nodes and s spares, it takes the failure of (s+1) spares to take the system down. The Equation in question is the number of ways that you can select (s+1) systems out of n systems. If s = n, selecting (s+1) = (n+1) systems is meaningless, so the equation does not apply. In all cases, s < n or else the system will fail if all spares fail, which is not the definition of spares.

    If you want to continue this discussion, please feel free to contact me at editor@availabilitydigest.com. Thanks for your interest in the Digest. And thank you, Bill, for the plug.

    Bill Highleyman
    Managing Editor
    Availability Digest

  6. Omar Jaber said,

    May 20, 2010 @ 11:54 am

    I could not have done a better job than Bill even if I tried to write on Sunday 10 am. This is a great article. I enjoyed it:)

    Thanks
    Omar

  7. Michael P. said,

    September 22, 2010 @ 7:43 am

    Great article. A complex topic but the logical approach works very nicely and makes it very readable. One question – how did you determine / calculate the availability of each individual component (e.g. DB=99.9%)?

  8. bill said,

    September 23, 2010 @ 6:14 am

    Michael,

    For individual components you can measure their history, i.e. how many times di this server go down last year? If they are new, you make educated guesses based on experience, manufacturer specs, and overall design.

    From the article: “Estimating availability can be a challenge. One method would be to look at historical data. If you don’t have access to good data, you can form an estimate based upon your standard operating parameters. For instance, if you release new code to your web server twice per month, and each release causes 2 hours of downtime, that would translate to 48 hours of downtime per year. If you expect to perform operating system maintenance once per quarter, at an estimated downtime of 2 hours per quarter, that would equal another 8 hours per year. If you also anticipate one hardware failure per year, and had a next business day warranty, that would translate to on average 41 hours of downtime per year (Friday outages get repaired on Monday. Saturday and Sunday outages get repaired on Tuesday). Adding these numbers up, we get: 48 + 8 + 41 = 98 hours of downtime per year, or an estimated 98.882% availability.”

    -Bill

  9. performance testing said,

    November 3, 2010 @ 10:26 pm

    Hi,
    Nice posting. Keep it up nice posting like this.

    Thanks & Regards,
    Katerinak.

  10. Neeraj Kumar said,

    August 16, 2011 @ 9:25 pm

    Hello Bill,

    Excellent article. It really helped me to compute SLA on one of the new business opportunity, instantly.

    Thanks!
    - Neeraj

  11. mohamed eltahan said,

    October 26, 2011 @ 11:26 am

    of course excellent work…..,,,,,,,, but as a point of debate…
    Availability calculations in deterministic systems may give a logical prediction, while same availability works might not be logical in non deterministic systems
    ex: private communication system deployed in a real time scada system under such availability calculations give an engineering since….
    While utilizing public communication systems (mpls, 3g, ..etc) would imply a non defined risk even the operation sign a contract or agreement to guarantee service continuity . in this case availability will not be that much easy to guarantee expected availability figures ….
    i think availability is not that magical engineering term to validate designs ….
    i need yr comments and suggestion of real availability calculation in a typical SCADA system implementation.
    Kind regards

    mohamed eltahan

  12. bill said,

    October 26, 2011 @ 1:28 pm

    Mohamed,

    I’m not sure I understand your question, although I’d really like to. I’d consider all IT systems non-deterministic, but there is enough determinism in the constituent components that when designed properly, you can have a reasonable chance at accurate prediction. You used the example of a private communication system deployed with scada. Under that scenario, you would estimate an expected availability (Not necesarily the SLA contracted for) to derive its impact on the total availability of the system. If its expected availability is not sufficient for the desired end result, you would then need to add redundancy, possibly with a second independant communication system.

    So, let’s posit that we have a multi-site scada system, which uses MPLS to communicate under nominal operations. We have an SLA of 99.999% from the carrier, but we only believe it is likely to be 99% in the real world. We could then choose an alternate, such as a cellular system, that would take over if MPLS is down. It would be important to make sure the cellular system doesn’t rely on the same MPLS network, ie its with another carrier and uses different backhaul fiber. We would then estimate its availability also at 99%. This means that there is a small likelyhood that both systems will be down at the same time. This is the same calculation as equation #2 above and would give us 99.99% expected availability. If this isn’t enough, we might choose a 3rd option. It depends on the cost of downtime versus the cost to mitigate the expected risk. I hope this answers your question. If not, please restate it. Thanks for commenting!

  13. Jupp Choi said,

    January 3, 2012 @ 7:29 pm

    Bill,

    Excellent read for anyone embarking on system availability calculation.
    Thanks for sharing.

  14. Raphael Alvarez Teixeira said,

    January 25, 2012 @ 7:37 am

    Hi my friend! Good afternoon!
    How are you ?

    Sorry, but do you have any spredsheet to calc SLA ? For example: I Opened the ticket yesterday at 16:59PM and this ticket has closed today at 10:00AM anddd my SLA ir Worked days (08AM to 17PM)…By the way if you calculated this, you find the 3hours and 1 minut. All Rigth?! So can you help me with this please? If you don’t understand my problem, contact me please :D

    Have a nice day! See you later!!!

    Best Regards,

    Raphael Teixeira

  15. bill said,

    January 25, 2012 @ 12:24 pm

    Raphael,

    I don’t think I understand your question. It depends on your SLA. If your SLA is a next business day response and you closed the ticket before 17:00 the next day, you were within your SLA so that is 100%. Are you asking if I have a spreadsheet that tells you for any given day, what the next business day is?

  16. Stuart said,

    April 12, 2012 @ 5:39 am

    Thanks Bill a most useful and thought provoking article, it seems to me that any model such as this still has potential for SPOF’s screwing the whole thing up, and would need to be strongly caveated.
    Also in consultancy when making recommendations customers always expect 97% + Availabilty with minimal costs, having a table such as the one you have above will make it easier for me to convince them to either increase their H/W Budget or expect less in terms of availability.

    Many thanks for sharing with us.

  17. Ming said,

    July 19, 2012 @ 8:28 pm

    Nice article.

    To make sure I understand this correctly, here some questions regarding the paragraph and load & avaiabliity:

    1. “… but under peak, we need three servers? … but under peak load, the availability would drop back to 85%. …”. Shouldn’t the availability be 85% * 85% * 85% in this case?
    2. “… What if our peak load required 2 servers? In this case, the availability under peak would be 97.75%… “. In case, we need at least any two servers to be up. In my calculation, it is 85% * 97.75% + ( 1 – 85% ) * 85% * 85% = 93.93%; when the first server is up, we just need at least one server up from the other two; when the first server is down, we need both of the other two up.

  18. bill said,

    July 20, 2012 @ 12:08 am

    Ming,

    You might be right on both cases, but I need to spend some time thinking about it.

  19. bill said,

    July 20, 2012 @ 12:22 am

    Ming,

    I think you are correct on both points. The equation is still correct:
    As = Ac(n-1) + ((1 – Ac(n-1)) * Acn)

    What you are doing is producing a more accurate calculation for Ac(n-1) during peak periods. My paragraph was in fact too simplistic. You could get more refined that that if you have good data for your traffic patterns and loads. You could calculate it based on 30-minute or 1-hour time slices. It depends on how sophisticated you want your model. Thanks for pointing that out!

  20. Chris Baechle said,

    March 28, 2013 @ 5:55 am

    This was a good article because it explained the equations. I think a lot of the comments here just want a plug ‘n play formula. Without understanding the concepts from which these were derived, you can’t understand its limitations. All these models assume uniform distribution. This is fine for purposes of the article and demonstration, but a more realistic approach would be to use normal distribution for down times.

    You also can’t understand how to apply this to new concepts. For example, this same article could be applied to a RAID 1 drive configuration, but not directly to a RAID 5. However, only a slight modification would be needed to do so if you understood what’s doing on.

    The most realistic model would need to include Bayesian probability incorporating each type of fault that could cause down time. A code bug has a high probability of taking all servers to a down state, but a motherboard damaged due to ESD would most likely only affect a single server. All the things everyone is mentioning can be accounted for by other much more complex models.

    Take this for what it is. It’s a good engineering approximation to demonstrate that adding servers to create redundant systems eventually has diminishing returns.

RSS feed for comments on this post · TrackBack URI

Leave a Comment