<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>edgeblog &#187; Data Center Design</title>
	<atom:link href="http://www.edgeblog.net/category/data-center-design/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.edgeblog.net</link>
	<description>Notes from the edge</description>
	<lastBuildDate>Mon, 25 Jan 2010 03:51:37 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=abc</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>IronScale &#8211; The Future of Web Hosting?</title>
		<link>http://www.edgeblog.net/2008/ironscale-the-future-of-web-hosting/</link>
		<comments>http://www.edgeblog.net/2008/ironscale-the-future-of-web-hosting/#comments</comments>
		<pubDate>Tue, 09 Sep 2008 14:47:07 +0000</pubDate>
		<dc:creator>bill</dc:creator>
				<category><![CDATA[Data Center Design]]></category>
		<category><![CDATA[General]]></category>
		<category><![CDATA[Popular]]></category>
		<category><![CDATA[colo]]></category>
		<category><![CDATA[colocation]]></category>
		<category><![CDATA[data center]]></category>
		<category><![CDATA[hosting]]></category>
		<category><![CDATA[ironscale]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[managed server]]></category>
		<category><![CDATA[ragingwire]]></category>
		<category><![CDATA[Security]]></category>
		<category><![CDATA[unix]]></category>
		<category><![CDATA[windows]]></category>

		<guid isPermaLink="false">http://www.edgeblog.net/?p=114</guid>
		<description><![CDATA[     <link rel="alternate" type="application/atom+xml" title="edgeblog Category: Data Center Design" href="http://www.edgeblog.net/category/data-center-design/feed/" />
     <link rel="alternate" type="application/atom+xml" title="edgeblog Category: General" href="http://www.edgeblog.net/category/general/feed/" />
     <link rel="alternate" type="application/atom+xml" title="edgeblog Category: Popular" href="http://www.edgeblog.net/category/popular/feed/" />
<p><a title="IronScale - The Future of Web Hosting" href="http://www.ironscale.com" target="_self"><img class="alignleft" style="float: left;" src="http://www.ironscale.com/images/headers/logo.gif" alt="IronScale" width="230" height="69" /></a>It was the shot heard round the hosting world. Last month, my good friends at <a title="RagingWire" href="http://www.ragingwire.com" target="_blank">RagingWire</a> announced their latest offering, <a title="IronScale - The Future of Web Hosting" href="http://www.ironscale.com" target="_blank">IronScale</a>, which has the potential to fundamentally change the hosting business. At least, that&#8217;s what the <a href="http://www.reuters.com/article/pressRelease/idUS136867+13-Aug-2008+MW20080813" target="_blank">press release</a> and the voice mail I received from Doug Adams, their head of sales claimed. Now, I&#8217;ve been doing business with RagingWire for almost 8 years, and I often tell people they have the best designed/built/run data center in Northern California, so I know they offer great services. I&#8217;m one of their only three-peat customers (I&#8217;ve put three different companies into their facility) and I&#8217;ve never been disappointed. Still, I tend to discount terms like &#8220;game-changing&#8221; as marketing fluff. I&#8217;m a &#8220;show-me&#8221; kind of guy. <a title="Check out the Flash Demo" href="http://www.ironscale.com/videos/demo" target="_self">So they did</a>.</p>
<p>Today I had the pleasure of an on-site demonstration and walk through of the IronScale service. I am impressed. On the surface, it is a typical managed server hosting offering. You rent one or more dedicated servers in their data center and they provide the operating system, network, internet bandwidth, security, etc. Pretty common stuff, and pretty boring. Why did I drive to Sacramento on one of the hottest days of the year for this (110F)? Well, you have to look beneath the surface, which I did, to see what they are really offering. At what I saw was awesome.<!--more--></p>
<p>IronScale makes some <a href="http://www.ironscale.com/Compare" target="_blank">bold claims</a> for their services. Among them:</p>
<ul>
<li>Dedicated physical hardware (Not VMWare or Zen virtualization)</li>
<li>Less than 5 minute server deployment</li>
<li>Instant reconfiguration of servers</li>
<li>Instant backups</li>
<li>No cost to reload the operating system</li>
<li>Everything easily manageable through a web portal, including provisioning, console access, network configuration, firewall rules, backups, and bandwidth provisioning</li>
<li>On demand Raid 50 storage</li>
<li>Enterprise class network security</li>
</ul>
<p>The first two bullet points are what instantly sets IronScale apart from the competition. Most hosting provides focus either on rapid deployment or dedicated hardware, but not both. If you want dedicated hardware, then provisioning time takes 24-48 hours, because it takes time for an engineer to Ghost/JumpStart a server and put it on the network. If you want rapid deployment, you use virtualization technologies like VMWare. RagingWire figured out a Door #3.</p>
<p>Some of what I learned today is under NDA, and the product is in Beta and patent pending, so I need to tread carefully, but here is my best explanation of what they are doing. The core of the product offering is commodity servers connected to an enterprise-class storage area network. Rather that laying an OS down on the physical server, they SAN-boot it. Provisioning a new server is as simple as taking a SAN snapshot of an existing server image, and assigning it to a new server. Each server is connected to a high-end Cisco switch with integrated firewall that provides security and isolation from every other server at the port level.</p>
<p>Simple concepts. The special sauce here is not the mix of technologies, but the software they&#8217;ve written to control it all, and do so simply and securely. Want to provision a new server? Grab one out of the pool assigned to you, pick the OS baseline image you want on it, name it, give it an IP address from the available pool, and boot it. Click the next tab and you can write firewall rules in an easy to understand format. The IronScale software translates into the appropriate Cisco commands and applies the configs to the firewall. Need more storage? Pick the server, assign more space, and the software updates the SAN to allocate space to that image. The running OS instantly recognizes it as new available space (no reboot required).</p>
<p>Need a bigger server? Shut down the running server. Assign the image to a bigger box, and reboot. Want to backup your servers before applying the latest updates from Microsoft on Patch Tuesday (you know, just in case&#8230;)? Take an instant SAN snapshot. If something goes wrong, revert back. No calls to customer service. No waiting for days or weeks. It is fast, and it is easy.</p>
<p>Basically, this IS virtualization but not like what you&#8217;ve seen before. Instead of VMWare, Zen, Cloud, Grid, pick your buzz word virtualization, this is storage-based virtualization combined with some really kick butt management software. The servers are dedicated but the storage is not. The beauty of this is, among other things, it isn&#8217;t OS restricted. Although the initial offering is based on Windows and RedHat, there is no reason they can&#8217;t support Solaris (x86), BSD, or other flavors of Linux in the future. They don&#8217;t need to hook into the OS to perform their services, so any OS that will run natively on the hardware will work. And because the servers they are running are very generic, driver support should not be much of an issue. All you need is the 1st OS image, and the SAN snapshots do the rest.</p>
<p>This service is clearly designed for the mid market. I would say their sweet spot is customers wanting 5 &#8211; 50 servers, but the service can scale up to hundreds, if not thousands of servers. For bigger companies, it would make a good platform for proof-of-concepts, development environments, or any situation where you need to be able scale capacity up or down rapidly and make frequent changes. What would push this into uber-cool status is if IronScale works out billing based on time-slices. So, if I want a pool of 50 servers to generate load for testing purposes, but I only need them 1 hour per night, or I need a compute farm for end-of-month processing, I could pay just for the time I use the servers and shut them down the rest of the time. Other companies are offering capacity on demand, but most of those are grid-based, and you need to modify your applications to take advantage. IronScale could do capacity on demand with zero customer modifications.</p>
<p>This is a Beta product, because they are still adding features and I&#8217;m sure fixing bugs in the management software, but the core offering is fully baked and ready for prime time. I expect in the future, they will layer on a plethora of additional services, such as advanced security scanning, server virtualization to increase utilization of the hardware, and database server clustering. I sincerely hope they also package their management software and start selling it to enterprises in the future. Based on what I saw today, I would buy it with some minor tweaks.</p>
<p>I am planning to beta test this offering in the next few weeks, so hopefully I&#8217;ll have more information, and some screen shots soon. In the mean time, I encourage you to check out the online demo and give IronScale a serious look. <a href="http://www.ironscale.com/videos/demo" target="_blank">http://www.ironscale.com/videos/demo</a></p>
<p><a href="http://digg.com/linux_unix/IronScale_Revolutionizes_Web_Hosting"><img src="http://digg.com/img/badges/85x10-digg-link.gif" alt="Digg!" height="10" width="85" /></a></p>
]]></description>
		<wfw:commentRss>http://www.edgeblog.net/2008/ironscale-the-future-of-web-hosting/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>In Search of Five 9s &#8211; Calculating Availability of Complex Systems</title>
		<link>http://www.edgeblog.net/2007/in-search-of-five-9s/</link>
		<comments>http://www.edgeblog.net/2007/in-search-of-five-9s/#comments</comments>
		<pubDate>Mon, 29 Oct 2007 12:30:42 +0000</pubDate>
		<dc:creator>bill</dc:creator>
				<category><![CDATA[Data Center Design]]></category>
		<category><![CDATA[General]]></category>
		<category><![CDATA[Systems]]></category>

		<guid isPermaLink="false">http://www.edgeblog.net/2007/in-search-of-five-9s/</guid>
		<description><![CDATA[<p>I&#8217;ve spent the past few days trying to develop a simple mathematical model to predict the expected availability of complex systems. In IT, we are often asked to develop and commit to service level agreements (SLAs). If the points of failure of the system are not analyzed, and then the system availability calculated, the SLA is flawed from the beginning. To complicate matters further, different people have different definitions of availability. For instance, does scheduled downtime for maintenance count against your system availability calculation?</p>
<p>Common Availability Definitions:</p>
<ol>
<li>Availability = MTBF/(MTTR+MTBF) (Mean Time Between Failure, Mean Time To Recover). This is a classic definition of availability and is often used by hardware manufacturers when they publish an availability metric for a given server.</li>
<li>Availability = (Uptime + Scheduled Maintenance)/(Unscheduled Downtime + Uptime + Scheduled Maintenance). This is an IT centric availability metric where the business can support scheduled downtime after hours. This model works for some types of systems, such as a file server that isn&#8217;t needed at night, but it doesn&#8217;t work as well for websites, even though many web companies still use this for their SLAs.</li>
<li>Availability = Uptime/(Uptime + Downtime). This metric best applies to systems that are needed 24&#215;7 such as e-commerce sites.</li>
</ol>
<p>Availability is most often expressed as a percentage. Sometimes, people will refer to &#8220;four nines&#8221; (99.99%) or &#8220;five nines&#8221; (99.999%). To simplify things, the following table shows the minutes of downtime allowed per year for a given availability level:</p>
<table border="1" align="center">
<tr>
<th>
<p align="center">Availability</p>
</th>
<th>Min Downtime/Year</th>
<th>
<p align="center">Hours Downtime/Year</p>
</th>
</tr>
<tr>
<td align="center">95.000%</td>
<td align="center">26,298</td>
<td align="center">438</td>
</tr>
<tr>
<td align="center">98.000%</td>
<td align="center">10,519</td>
<td align="center">175</td>
</tr>
<tr>
<td align="center">98.500%</td>
<td align="center">7,889</td>
<td align="center">131</td>
</tr>
<tr>
<td align="center">99.000%</td>
<td align="center">5,260</td>
<td align="center">88</td>
</tr>
<tr>
<td align="center">99.500%</td>
<td align="center">2,630</td>
<td align="center">44</td>
</tr>
<tr>
<td align="center">99.900%</td>
<td align="center">526</td>
<td align="center">8.8</td>
</tr>
<tr>
<td align="center">99.990%</td>
<td align="center">52.6</td>
<td align="center">.88</td>
</tr>
<tr>
<td align="center">99.999%</td>
<td align="center">5.26</td>
<td align="center">.088</td>
</tr>
</table>
<p><!--more-->Based on the above table, you can see there is a big difference between an SLA specifying 99% availability (88 hours of downtime per year) and 99.9% availability (8.8 hours of downtime per year). But how can we be sure what the expected downtime of the system is? In the most simplistic form, the expected availability of a system equals the expected availability of each of the components of the system multiplied together. So if the system was composed of two servers, and each server had an expected availability of 99%, then the expected availability of the system would be 99% * 99% = 98.01%. Note: I have used the term &#8220;expected availaibilty.&#8221; We are calculating a future expectation of the system over an extended period of time, not a historical availability. For the rest of this article, I will drop the term expected for brevity, but it is always implied.</p>
<p>The simplistic model above is useful for illustrating the point that downtime is cumulative. In other words, if I expect each component to be down 88 hours/year, and a failure of either component is a failure of the system, then the system has an expected downtime of 174 hours. Why not 176 hours? Well, occasionally both components will be down at the same time.</p>
<p>Real world systems are never this simple. Typically, the system will be comprised of multiple components, some with redundancy, and each with different levels of component availability. Modeling these requires slightly more complicated formulas, but once you have the concepts down, then the actual calculations can be quickly performed in a spreadsheet. Before we move forward, we need some basic notations to simplify our formulas:</p>
<p id="code"><strong>Basic Nomenclature</strong><br />
Availability Component 1 = Ac<sub>1</sub><br />
Availability Component 2 = Ac<sub>2</sub><br />
Availability Component 3 = Ac<sub>3</sub><br />
Availability Component n = Ac<sub>n</sub><br />
Availability System = As</p>
<p>Now that that is out of the way, we are ready for our first formula. When a system is made up of n number of components that are each single points of failure then the system availability can be calculated as:</p>
<p><strong>EQUATION #1: As = Ac<sub>1</sub> * Ac<sub>2</sub> * Ac<sub>3</sub> * &#8230;Ac<sub>n</sub></strong></p>
<p>Consider a 24&#215;7 e-commerce site with lots of single points of failure. We might model the site as having the following eight components:</p>
<table border="1" align="center">
<tr>
<th>
<p align="center">Component</p>
</th>
<th>
<p align="center">Availability</p>
</th>
</tr>
<tr>
<td align="center">Web</td>
<td align="center">85%</td>
</tr>
<tr>
<td align="center">Application</td>
<td align="center">90%</td>
</tr>
<tr>
<td align="center">Database</td>
<td align="center">99.9%</td>
</tr>
<tr>
<td align="center">DNS</td>
<td align="center">98%</td>
</tr>
<tr>
<td align="center">Firewall</td>
<td align="center">85%</td>
</tr>
<tr>
<td align="center">Switch</td>
<td align="center">99%</td>
</tr>
<tr>
<td align="center">Data Center</td>
<td align="center">99.99%</td>
</tr>
<tr>
<td align="center">ISP</td>
<td align="center">95%</td>
</tr>
</table>
<p>If any of these components fails, the website will crash. The expected availability of the site would be 85%*90%*99.9%*98%*85%*99%*99.99%*95% = 59.87%. Note that we are modeling each component as a whole, rather than looking at its parts. We could break down the web service into software (Apache), code (our web site), and hardware (motherboard, hard drives, etc.). For our purposes, complexity does not necessarily improve the model, so we will deal with the service as a whole. Also, for this discussion we will use the third definition of availability above. To our users, it does not matter if the site is down because of maintenance or a hard drive failed.</p>
<p>Assuming we want to keep our jobs, we need to figure out a way to increase this availability. Two obvious targets for improving the site stability would be the web service and the firewall. The question is, what effect would adding another web server have on the availability of this service. This leads us to our second equation. When a system is comprised of two redundant components, then the availability of the system can be calculated as:</p>
<p><strong>EQUATION #2: As = Ac<sub>1</sub> + ((1 &#8211; Ac<sub>1</sub>) * Ac<sub>2</sub>)</strong></p>
<p>Using our example of a web server with an availability of 85%, then adding a second server would increase the availability to: 85% + (1-85%)*85% = 97.75%. The logic behind this is that when the 1st server is down (15% of the time), the second server is still up 85% of the time. This may or may not translate into actual real world availability. For instance, if the web server was down so often because we constantly need to take it offline to deploy new code, then adding a second server should translate to higher availability, because we could deploy code to one offline server, while the other server stays up. In this case, our real world availability increase might be greater than 12.75%. Conversely, if our service is down due to code bugs, then adding a second server could in some cases make the availability worse by exasperating the bug.</p>
<p>The point is that in general if you&#8217;ve accurately estimated the availability of the component, then the equation will work. Note also that the equation works even if the components have unequal availability estimates. Assume that the web server has an availability problem because the hardware is undersized. Now assume the second server we purchase has twice the capacity, and we determine that by itself the availability of the new server would be 90%, then our equation changes to: 85% + (1-85%)*90% = 98.5%.</p>
<p>So let&#8217;s add this back into our system calculation above. Assume that we add a second web server, and a second firewall, increasing the availability of each component system to 97.75%. Now, the availability of our system would be: 97.75%*90%*99.9%*98%*97.75%*99%*99.99%*95% =79.10%. Better, but still not great. It is hard to achieve any level of high availability when you have single points of failure. So let&#8217;s assume that we add a redundant component for all our servers and network equipment. Let&#8217;s also assume we add a second ISP for carrier diversity, but we are still within one physical data center. Our availability equation would now be: 97.75%*99%*99.9999%*99.96%*97.75%*99.99%*99.99%*99.75% = 94.3%. Getting better. Removing single points of failure improved our availability of the system from 60% (3506 hours of downtime/year) to 94.3% (500 hours of downtime/year).</p>
<p>Equation #2 above modeled adding a single redundant component. In some cases, we may add more than one redundant component. For example, we may have more than two web servers. In this case, we need to iterate through equation #2 multiple times to find the effect of the additional components, which brings us to our third equation. When trying to calculate the availability of a service with n number of redundant components, we calculate it as:</p>
<p><strong>EQUATION #3: As = Ac<sub>(n-1)</sub> + ((1 &#8211; Ac<sub>(n-1)</sub>) * Ac<sub>n</sub>)</strong></p>
<p>In the case of our web service, adding a 3rd server would change the availability to: 97.75% + (1-97.75%)*85% = 99.6625%. Adding a fourth server would increase availability to: 99.6625% + (1-99.6625%)*85% = 99.949%. Note that there is a diminishing level of return. Adding a 2nd server increased availability by 12.75%. Adding a 3rd server only gained us 1.9125%. The fourth server bought us a paltry .2865%. And yet, even with 3 more servers than we need to service our load, we still have not yet achieved the elusive four nines of availability. Designing a highly available system requires us to make individual components highly available AND add redundancy of components. If the individual web server in our example had an availability of 90%, instead of 85%, then the availability of two servers would be 99% and three servers would be 99.99%.</p>
<p>Equations 2 &amp; 3 have a simple flaw, in that they assume a single component can handle the load, and that the load is constant. What if under normal operations, one web server can handle the load, but under peak, we need three servers? Then our availability for three servers under normal load would be 99.775%, but under peak load, the availability would drop back to 85%. Under peak load, the failure of one server would cause a loss of the service, so we drop back to the availability of a single box. What if our peak load required 2 servers? In this case, the availability under peak would be 97.75%. If peak requires two servers, and we have three, then we can lose one server and still be operational, therefore our availability is the equivalent of having two servers. The important concept here is <strong>there is an inverse relationship between load and availability</strong>.</p>
<p>What should be obvious by now, is that achieving truly high levels of availability (99.9% &#8211; 99.999%) is very difficult and very expensive. One of the most expensive single points of failure to eliminate is the data center. In most cases, this effectively doubles the cost of your infrastructure, and the cost may be even greater than 2x, because you will often need to invest in technology to keep the servers in each data center in synch with each other.</p>
<p>Consider, however, the impact of adding a fully redundant data center. In our example above, the availability of our data center will redundant servers and ISPs was 94.3%. Adding a second data center with the necessary technology to make both centers work active-active (both data centers take traffic at the same time) would increase our availability to: 94.3% + (1-94.3%)*94.3% = 99.675%. Adding a second data center saved us 471 hours of downtime per year!</p>
<p>In this example, we assumed that each data center was an independent system, so a failure of a service in one data center would be a failure of the entire system in that data center. This is not always the case. For instance, if properly designed, a web server in one data center could connect to a database server in the other data center. In this case, the expected availability of the system would be higher that 99.675%. If you were able to design your site so that each service operated independent of the other services, then the availability in our example would increase from 99.675% to 99.888% (each service would have 3 redundant components, except the data center, which would have 1 redundant component).</p>
<p>These formulas are much easier to play with in Excel. Paste the following table into a spreadsheet, starting at cell A1:</p>
<table border="1" align="center">
<tr>
<td>
<p align="center"><strong>Avail %</strong></p>
</td>
<td>
<p align="center"><strong>1 Component</strong></p>
</td>
<td>
<p align="center"><strong>2 Components</strong></p>
</td>
<td>
<p align="center"><strong>3 Components</strong></p>
</td>
<td>
<p align="center"><strong>4 Components</strong></p>
</td>
</tr>
<tr>
<td align="center">Web</td>
<td align="center">85%</td>
<td align="center">=B2+((1-B2)*$B2)</td>
<td align="center">=C2+((1-C2)*$B2)</td>
<td align="center">=D2+((1-D2)*$B2)</td>
</tr>
<tr>
<td align="center">Application</td>
<td align="center">90%</td>
<td align="center">=B3+((1-B3)*$B3)</td>
<td align="center">=C3+((1-C3)*$B3)</td>
<td align="center">=D3+((1-D3)*$B3)</td>
</tr>
<tr>
<td align="center">Database</td>
<td align="center">99.9%</td>
<td align="center">=B4+((1-B4)*$B4)</td>
<td align="center">=C4+((1-C4)*$B4)</td>
<td align="center">=D4+((1-D4)*$B4)</td>
</tr>
<tr>
<td align="center">DNS</td>
<td align="center">98%</td>
<td align="center">=B5+((1-B5)*$B5)</td>
<td align="center">=C5+((1-C5)*$B5)</td>
<td align="center">=D5+((1-D5)*$B5)</td>
</tr>
<tr>
<td align="center">Firewall</td>
<td align="center">85%</td>
<td align="center">=B6+((1-B6)*$B6)</td>
<td align="center">=C6+((1-C6)*$B6)</td>
<td align="center">=D6+((1-D6)*$B6)</td>
</tr>
<tr>
<td align="center">Switch</td>
<td align="center">99%</td>
<td align="center">=B7+((1-B7)*$B7)</td>
<td align="center">=C7+((1-C7)*$B7)</td>
<td align="center">=D7+((1-D7)*$B7)</td>
</tr>
<tr>
<td align="center">Data Center</td>
<td align="center">99.99%</td>
<td align="center"> </td>
<td align="center">=B8+((1-B8)*$B8)</td>
<td align="center"> </td>
</tr>
<tr>
<td align="center">ISP</td>
<td align="center">95%</td>
<td align="center">=B9+((1-B9)*$B9)</td>
<td align="center">=C9+((1-C9)*$B9)</td>
<td align="center">=D9+((1-D9)*$B9)</td>
</tr>
<tr>
<td align="center"><strong>System Avail %</strong></td>
<td align="center">=b2*b3*b4*b5<br />
*b6*b7*b8*b9</td>
<td align="center">=c2*c3*c4*c5<br />
*c6*c7*b8*c9</td>
<td align="center">=d2*d3*d4*d5<br />
*d6*d7*d8*d9</td>
<td align="center">=e2*e3*e4*e5<br />
*e6*e7*d8*e9</td>
</tr>
</table>
<p>Now that you have the basic concepts down, and the beginnings of a spreadsheet for calculating changes to our assumptions, you can focus on applying these theories to your unique situation. Start by breaking down your system, be it a web site, an accounting system, or a file server, into individual component services. For each service, determine the minimum number of units required for the system to work, and the expected availability of a unit.</p>
<p>Estimating availability can be a challenge. One method would be to look at historical data. If you don&#8217;t have access to good data, you can form an estimate based upon your standard operating parameters. For instance, if you release new code to your web server twice per month, and each release causes 2 hours of downtime, that would translate to 48 hours of downtime per year. If you expect to perform operating system maintenance once per quarter, at an estimated downtime of 2 hours per quarter, that would equal another 8 hours per year. If you also anticipate one hardware failure per year, and had a next business day warranty, that would translate to on average 41 hours of downtime per year (Friday outages get repaired on Monday. Saturday and Sunday outages get repaired on Tuesday). Adding these numbers up, we get: 48 + 8 + 41 = 98 hours of downtime per year, or an estimated 98.882% availability.</p>
<p>With a little work, you can estimate a realistic level of availability of your system. This is the cornerstone of creating realistic and attainable SLAs. These formulas can help IT negotiate SLAs with the business, and can help determine the comparative ROI of different solutions. For instance, let&#8217;s say you were trying to choose a web server solution, and you had two choices:</p>
<ul>
<li>Choice number 1 consists of 4 servers using cheap hardware with no internal redundancy. Each server costs $3,000. You estimate the availability of each server to be 75%.</li>
<li>Choice number 2 consists of 2 servers using expensive hardware with redundant hard drives and power supplies. Each server costs $20,000. You estimate the availability of each server to be 99%.</li>
</ul>
<p>You estimate the cost of downtime to be $500/hour, and you expect these servers to support your site load with a single server for the next 3 years, after which they will be replaced. Using the above numbers, Solution #1 has an expected availability of 99.6%, at a cost of $12,000. Solution #2 has an expected availability of 99.99% at a cost of $40,000. Solution #1 would experience 34 hours/year, or 102 hours over three years of downtime more than solution #2. Over three years, this extra downtime would cost $51,000. So by spending $28,000 upfront, you would get a three year ROI of 182%. Note that the model is only as good as your estimates. If the servers in solution #2 only had 95% availability, then their combined availability would be 99.75%, which would only provide 13 hours less downtime per year. In this case, you would only save $20,000 in downtime over three years for your $28,000 investment, so you would be better off with solution #1.</p>
<p>Designing and operating highly available systems is complicated work, but with a few simple formulas, it is possible to understand and predict their behavior at a macro level. This will allow you to make better decisions when choosing between multiple options, and give more realistic predictions when negotiating SLAs.</p>
<p><a href="http://digg.com/linux_unix/In_Search_of_Five_9s_Calculating_Availability_of_Complex_Systems"><img width="85" src="http://digg.com/img/badges/85x10-digg-link.gif" alt="Digg!" height="10" /></a></p>
]]></description>
		<wfw:commentRss>http://www.edgeblog.net/2007/in-search-of-five-9s/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic Page Served (once) in 0.767 seconds -->
