Cloud Catastrophes Part 1: Geo Redundancy

Shawn Anderson and German Martinez

Cloud Catastrophes Part 1: Geo Redundancy

Worldwide public cloud growth will increase by 6.3% in 2020 totaling $257.9 billion, according to Gartner. Cloud adoption is still growing, ever expanding our reliance on public cloud services. What happens when a major cloud provider suffers an outage?

September 28,2020: Users and applications that leveraged Microsoft Teams, Office365, or Outlook services were not able to authenticate, essentially locking them out for over five hours.
June 5, 2019: Consumer services such as Google’s Gmail, G Suite, YouTube, and Nest services were unavailable. Additional customers such as Apple’s iCloud, Snapchat, and Discord also experienced a disruption lasting over four hours.
February 27, 2017: Amazon’s AWS US East 1 region’s S3 storage system went offline impacting its own health dashboard, services, and multiple customers based in that region for over four hours.

In this multipart series, “Lessons Learned from Cloud Catastrophes,” we will review some past public cloud disasters and discuss what we can do to mitigate the impact of these events. In part one of this series, we will focus on some of the key aspects of protecting your IT systems against a geographic-based outage.

Geo Redundancy

When adopting cloud services, organizations hand over the responsibilities of computing, storage, and networking to the provider whose goal is to commoditize these services on a massive scale. While these services are designed to be resilient, it does not mean they are foolproof. As consumers, organizations are still responsible for designing and implementing geo redundancy within our own ecosystems and services.

It would not be fair to say that every outage is caused by negligence. However, that does not mean each outage is also caused by a natural disaster. Usually, it will be a combination of different unique situations colliding in an unforeseen way.

August 25, 2019: AWS experienced a multiple server outage in the ap-northeast-1 region due to overheating caused by the failure of the data center control system used to optimize the cooling systems.
September 18, 2020: Azure’s Premium File services in the US East region experienced an outage due to the rollout of a feature to optimize IO operations. The new feature contained a bug creating a faulty response to users and applications when trying to access data on all front-end storage systems.

While both outages were contained to a specific region or geography within their respective ecosystems, the impact to the thousands of customers and their IT operations still resulted in downtime. Customers that designed and implemented geo redundancy were able to continue to operate with little to no downtime to their organization or customers.

Geo Redundant Design & Implementation

Multiple factors should be considered when designing for geo redundancy: cloud platform, application architecture, acceptable risk tolerance, and cost to name a few. A recommended starting point is to build out a complete set of diagrams that includes the infrastructure, applications, data flows, and third-party integration points. These diagrams will assist in identifying dependencies and what cloud services will be necessary when designing a geo redundant environment. Regardless of whether you are starting with a greenfield or modifying an existing environment, you should focus on these three key areas:

1. Architecture

Each public cloud provider has its own solution for creating geo redundant environments. These solutions will vary in composition of services, naming conventions, and architectures.

Microsoft’s Azure: Depending on your application and business requirements, your environment will consist of various platform services. Some of these services such as Azure’s Active Directory, DNS, and CDN are already in a highly available architecture and will replicate your data that is located within them. Additional services such as virtual machines, databases, storage, and webapps require configuration and sometimes process changes in your DevOps methodology to enable data and application replication. Services such as Geo-Redundant Storage (GRS) and Azure Site Recovery are specifically designed to support and enable the replication of your data, applications, and virtual machines between locations.

*This cloud native architecture leverages Azure CDN for static file content replication and geo-replication (GRS) services for the databases (Microsoft).

Amazon’s AWS: Similarly, AWS also offers default highly available architecture in their IAM, Route53, and CDN. There is also same-region replication for S3 now, which allows for a replication of data onto different availability zones (AZs) at no additional cost. For full geo redundancy, you would use cross region redundancy (CRR) for replicating objects into different AWS regions. There is also CloudEndure, which allows for fast automatic launch of thousands of virtual machines at failover in mere minutes.

*AWS CloudEndure does not require a reboot process and can be ready in a matter of mere minutes, it does this using lightweight replicated EC2 instances as pictured in the diagram above (AWS).

2. Monitoring & Alerting

Businesses have expectations of their cloud services just as they do of their on-premise infrastructure. Unlike on-premise infrastructure that traditionally focuses on vertical, logic-based services (such as compute, storage, and networking), cloud-native applications require horizontal logic. Cloud-native applications are often composed of numerous microservices and platform components spread across several hosts. Any one or multiple hosts are able to impact performance and availability. To bridge the gap and monitor these new types of services, there are several cloud-native and third-party tools available and capable of delivering a full picture of your environment spanning both on-premise and cloud-based infrastructure.

3. Disaster Recovery Testing

A false sense of security is worse than no security at all. Once the organization has implemented a site resilience strategy, it needs to be tested on a regular basis (at least annually). You don’t want to wait for a disaster to find out your disaster recovery plan does not work. These exercises are also a great way to expose hidden dependencies and technology debt. Organizations often shy away from performing these exercises because they can be painful, but the lessons learned are invaluable and it gets easier the more often you do it.

Count the Cost

Before making the decision to implement a geo redundant solution, we need to understand there are two types of costs involved. The first is the cost of purchasing and running the services necessary to have your ecosystem be geo redundant, which is easier to calculate. The second is the cost of downtime, which is much harder to calculate. The financial impact incurred when your environment is down due to a disaster requires us to look at four key areas:

1. Downtime (duration) - This is measured by calculating the difference between when the monitoring system first identified the outage, when all systems restored, and when services are once again accessible.

2. Targets - Each of an application’s dependencies needs to be listed out and identified on a scale from high (critical) to low (minimal) impact. This should be performed for internal- and external- (customer) facing services. Use the impact level to prioritize which services should be geo redundant.

3. Hard Cost - These are the costs associated with employee and or manufacturing productivity loss when they are not in operation. This calculation needs to include the loss of the average hourly revenue number plus the cost of a breach of service level agreements (SLA).

4. Soft Costs - This calculation is more difficult to perform because it’s the “squishy” monetary costs associated with areas such as opportunity loss, damages to the brand, and loss of consumer confidence. These areas may not be realized or calculable for months or even years later.

For more examples and additional details, you can refer to a previous blog post about The True Cost of Downtime.

Lessons Learned

Public cloud outages will continue to happen because hardware fails, humans are not perfect, and mother nature does not check in before striking. Building in geo redundancy will minimize productivity and financial losses when experiencing a disaster. With a proper review and a little financial analysis, your organization can understand its potential exposure to the next cloud catastrophe.

Do you need help analyzing your exposure or planning and implementing a geo redundant environment? Credera has experience helping organizations in public, private, and hybrid cloud scenarios. You can reach us at findoutmore@credera.com.