October 29, 2007
I’ve spent the past few days trying to develop a simple mathematical model to predict the expected availability of complex systems. In IT, we are often asked to develop and commit to service level agreements (SLAs). If the points of failure of the system are not analyzed, and then the system availability calculated, the SLA is flawed from the beginning. To complicate matters further, different people have different definitions of availability. For instance, does scheduled downtime for maintenance count against your system availability calculation?
Common Availability Definitions:
- Availability = MTBF/(MTTR+MTBF) (Mean Time Between Failure, Mean Time To Recover). This is a classic definition of availability and is often used by hardware manufacturers when they publish an availability metric for a given server.
- Availability = (Uptime + Scheduled Maintenance)/(Unscheduled Downtime + Uptime + Scheduled Maintenance). This is an IT centric availability metric where the business can support scheduled downtime after hours. This model works for some types of systems, such as a file server that isn’t needed at night, but it doesn’t work as well for websites, even though many web companies still use this for their SLAs.
- Availability = Uptime/(Uptime + Downtime). This metric best applies to systems that are needed 24×7 such as e-commerce sites.
Availability is most often expressed as a percentage. Sometimes, people will refer to “four nines” (99.99%) or “five nines” (99.999%). To simplify things, the following table shows the minutes of downtime allowed per year for a given availability level: