The hidden costs of chasing five nines

Achieving 99.999% availability is a benchmark for excellence in distributed systems. This level of availability guarantees only 5 minutes of downtime per year and ensures near-constant availability and reliability. Although this offers greater resilience and availability, the question is: at what cost?

The pursuit of constant availability entails significant financial, operational and human costs. In the search for the five nines, aspects that are often overlooked are those that emerge from the analysis to question whether the benefits justify the investment by considering the law of diminishing returns.

Financial costs: infrastructure and more

Although most mid-sized companies now use the cloud, it is important to consider the capital expenditure for redundancy with on-premises high availability infrastructure. Companies must deploy multiple data centers in different geographic locations to ensure continuous availability.

These multiple data centers protect against local outages, but there are costs associated with maintaining these data centers, which relate to:

Real estate and facilities: Leasing and purchasing of land and buildings. According to a report by Uptime InstituteThe potential cost of setting up a Tier III data center ranges from $7,000 to $12,000 per square foot. For a 10,000 square foot facility, this amount could be as high as $120 million.
Hardware: Servers, storage systems, and network devices must be duplicated across multiple sites for a highly available infrastructure. A single server costs at least $2,000, and a company may need hundreds of servers, representing an investment of about $2 million to $5 million.
Network: Fast, redundant network connections between data centers to facilitate failover handling and load balancing. High-speed network connections can cost between $100,000 and $1 million per year, depending on bandwidth and support.

Operating costs

The cost of managing complex and multi-layered redundancy levels, sophisticated failover mechanisms, and a multitude of interconnected services can lead to operational challenges for the following reasons:

Increased monitoring needs: As infrastructure grows, so does the need to monitor anomalies in real time. Large enterprises spend around $50,000 to $200,000 annually on solutions such as Datadog, New Relic or Splunk. In some scenarios, there is also a need for custom monitoring solutions, which increases costs.
Increased need for incident management: As monitoring increases, it is critical to manage incidents through robust incident management processes such as runbooks, escalation protocols, and communication strategies. According to the Ponemon Institute’s Cost of Data Breach Report, the average cost of a data breach in 2023 was approximately $4.45 million. While not all incidents result in breaches, the costs of downtime, investigation, and remediation can be significant. Even minor incidents can disrupt services and require many resources, especially for systems designed for high availability.
Increased need for quality assurance and testing: A multi-tiered infrastructure requires rigorous testing, including:

Disaster recovery exercises: Regularly simulate a disaster recovery scenario to ensure active and functional recovery processes. Costs depend on staff time, resource allocation, and potential disruptions to normal operations.
Pen testing: Performing regular scans and tests to ensure that all services are available and not vulnerable.
Performance testing: Continuously monitor system performance during peak traffic loads and the ability to dynamically scale without compromising availability.

A hypothetical bar chart comparing the costs of disaster recovery, penetration testing, and performance testing for different availability levels for a mid-sized organization.

Human costs

When you think about high availability, financial and operational considerations are always at the forefront; however, the human cost of maintaining these systems is also critical. The impact on IT operations and teams is so significant that it affects their overall well-being and job satisfaction. Some of these factors are not limited to:

Stress and burnout: The drive to achieve ‘five nines’ results in a significant number of staff being on call 24/7 to resolve any issues immediately. The expectation of responding to incidents immediately creates a high-pressure environment. Fear of serious consequences – reputational damage or job loss – exacerbates this further if service availability is compromised. This also has serious mental health implications.
Human factor in error rate: In a high-pressure environment, the likelihood of human-caused errors increases.
Employee turnover: A high-stress environment also leads to a higher turnover rate as developers avoid high-pressure environments and seek a better work-life balance.

Law of diminishing returns

According to this economic principle, as investment in a given area increases, performance or power gains eventually diminish. Applying this principle to the pursuit of high availability in distributed systems suggests that beyond a certain point, the additional investment will only provide marginal improvements in uptime.

Going from 99% to 99.9% often results in significant performance improvements in customer satisfaction and reliability. These are also the most cost-effective because they can be achieved by implementing standard practices in infrastructure redundancy, incident management, and monitoring. However, when you go from 99.9% to 99.99% or 99.999%, the costs increase dramatically due to all of the factors previously mentioned, such as improved redundancy, monitoring, and extensive testing.

As investments increase, the return, measured in terms of reduced downtime, also decreases. For example:

Improving from 99.9% to 99.99% availability results in a reduction from 8.76 hours/year to 52.6 minutes/year. While this improvement may sound remarkable, the cost is only justified in certain industries such as finance and healthcare.
By increasing availability from 99.99% to 99.9999%, downtime can be further reduced to 5 minutes/year. However, this comes at a significant cost, and most companies would find it difficult to justify this cost.

As you can see, it is extremely important to balance the costs of critical and non-critical systems. Companies must differentiate between mission-critical and non-critical systems and only pursue availability when downtime can significantly impact their reputation or costs.

Diploma

Reaching 99.9 …

While it makes sense for certain organizations in certain sectors to use the 5 Nines strategy, most organizations must weigh the costs against the benefits.

In most cases, it may make sense to aim for a lower availability target to achieve a better balance and enable more sustainable operations and a healthier work culture. As systems evolve, it becomes equally important to evolve the need for availability while prioritizing resilience, flexibility and the wellbeing of the people behind these systems.

Ajinkya Ghadge leads engineering teams in Pay-In at a leading travel technology company in Seattle, overseeing the development of fault-tolerant payment services with over $100 billion in annual revenue. Previously, he was a key contributor to the AI-driven marketing platform…

Breaking News

Grandtkitchenfilipinocuisine

Financial costs: infrastructure and more

Operating costs

Human costs

Law of diminishing returns

Diploma

Leave a Reply Cancel reply