Myths About Cloud High Availability

When organizations consider the advantages of migrating to the public cloud, there seems to be a false sense that cloud migration will increase system availability.

In this post, we will review several aspects of cloud services, and debunk the myth on the high availability of cloud infrastructure.

Compute

By default, when we deploy a new virtual machine in the cloud, it inherently lacks high availability. In cases where there is a demand for high availability for the services installed on the server, we need to deploy at least two separate servers; each in a different Availability zone in the same region, and plan a highly available access solution for the service. These range from traffic redirection based on DNS requests (in round robin mode), to placing the servers behind a load-balancer service that can redirect traffic between the servers, check service health of each of the servers, before redirecting traffic, etc.

Availability issues of a virtual machine is not always caused by unpredictable outages. It can be caused by planned customer maintenance, such as security patches that require server restart, or planned maintenance by the cloud service provider, such as firmware or security patch updates on the host machines, under multiple guest servers that belong to different customers.

For cluster or high availability architecture to function correctly, we need to make sure there is a mechanism for syncing data between the servers (usually at the application layer).

It is also important to make sure we do not store session states inside the servers, but on an external service such as Redis.

Storage

Object storage services (such as Amazon S3, Azure Blob storage, Google Cloud Storage, Oracle Cloud Object Storage) are managed storage solutions for storing files online and in archive mode.

For high availability, the cloud provider syncs files between several Availability zones in the same Region (and sometimes between Regions).

Block storage service (such as Amazon EBS, Azure Disk Storage, Google Persistent Disk, Oracle Cloud Block Volumes) are managed storage solutions than allows us to mount a volume inside the OS.

Block storage is limited because it belongs to a specific Availability zone, so there is no sync between Availability zones in the same Region.

There are managed file sharing services based on NFS protocol (such as Amazon EFS, Google Cloud Filestore, Oracle File Storage).

For high availability purposes, the cloud providers sync volumes between Availability zones in the same Region.

There is a managed file sharing service based on SMB/CIFS protocol (such as Amazon FSx for Windows File Server, Azure Files).

In case of Azure Files, the cloud provider is in-charge of high availability.

For Amazon FSx for Windows File Server, customers choose a highly available deployment method (Multi-AZ) or regular deployment method (Single-AZ).

Networking

To implement highly available services, it is recommended to select managed load-balancing services (such as Amazon ELB, Azure Load Balancer, Google Cloud Load Balancing, Oracle Cloud Load Balancing).

Due to the fact they are managed services, the cloud provider is in-charge of scale (based on load) and availability (spread the load over multiple Availability zones).

Databases

These are managed relational database services (such as Amazon RDS, Azure SQL Database, Azure Database for MySQL, Google Cloud SQL, Oracle MySQL Database Service).

It is important to understand that even though these are managed services, which includes security patches and backups, by default they are deployed as a single instance and are not considered highly available solutions.

Sometimes (based on the database engine support) we need to deploy the database in a cluster mode, as part of the managed environment, and sometimes in-order to deploy a highly available solution we need to choose a licensing tier (such as premium license).

Summary

As you can see, there is a diverse range of considerations to consider when migrating to public cloud services.

It is recommended to not fall into the trap of believing the false myth that “the cloud is always highly available as compared to on premise,” and always read the cloud provider’s official service documentation to understand the pricing models and design architectures based on organizational and end-user demands.

Additional References

AWS Well-Architected Framework – Reliability Pillar

Microsoft Azure Well-Architected Framework – Reliability Pillar

Google Cloud Architecture Framework – Reliability

Oracle Cloud Infrastructure – Reliability and Resilience

The post is also available through the link