Applications that require the highest levels of availability and quality of experience (QoE) require geographic redundancy, or geo-redundancy, which duplicates applications and infrastructure across geographic regions – or at least availability zones. This article looks at geo-redundant architectures, their challenges and advantages, and how a multi-cloud orchestrator like cloudify can help manage them.
Geographic redundancy seeks to avoid application down time by distributing applications and infrastructure across significant distances. A typical application of a geo-redundant architecture is dealing with natural disasters which may cripple certain areas. The ability to seamlessly reconnect clients to functioning regions is critical, particularly in availability sensitive domains such a Unified Communications as a Service (UCaaS), but is also desirable in most other domains as well. A service might have installations in two or three geographically diverse regions (both for proximity and availability), such as Atlanta, Dallas, and San Jose (or many other potential combinations.
Geo-redundancy usually takes one of two forms: disaster recovery (DR), or distributed (active/active).
Disaster Recovery: In the disaster recovery scenario, a ‘hot standby’ runs with no user traffic, only activated when the main site fails or can’t be reached. The standby isn’t cold, particularly for applications that have significant amounts of data, because it needs to be receiving data from the main site in order to be ready to accept traffic and any moment. Synchronization of data with the main site is eventually consistent (usually), meaning that there is a potential for data loss. The client itself is usually responsible for failing over to the backup site. When the main site comes back on line, it must be reconciled with data on the backup site.
Distributed Scenario: In the distributed scenario, clients are able to access all sites at all times. This makes more effective use of resources, but can complicate availability. The client may pick a site randomly, or by proximity, or some other criteria. The sites must replicate to each other and so may contain stale data, which the application must be able to deal with. To avoid data consistency issues, client sessions must be pinned to a single server, failing over much as the DR scenario above. Post disaster recovery is complicated by having multiple sites to synchronize in order to recover the failed site.
The Limits of Geo-redundancy
While having geo-redundant infrastructure is important for high availability, it isn’t a complete remedy. The introduction of asynchronous data replication has necessary but undesirable side effects like:
- Operational Complexity – The replication channels are critical and must be monitored. Even functional but degraded communications can cause problems as the message queue consumes space, and can begin to represent a “split brain”, where each system has a different view of the global state.
- Data Loss – Depending on the nature of system failure, unsent data can be lost permanently. The further the replication link falls behind system changes, the greater the impact of that loss.
- Recovery Complexity – In event of failure, the post failure reconstruction of the failed system can be complex, particularly if replication isn’t limited to relational data (which tends to have quality replication and recovery support). Consider that any reasonable high availability strategy involves at least three sites, since, once a failure occurs, there will still be a redundant pair of sites. If you only have two and one fails, you have no safety net left.
In addition to replication related issues, geo-redundancy only addresses inter-site user experience issues. The larger issue of quality of experience requires the individual sites also be highly available. Modern cloud native architectures address this very well, but may require significant effort to realize.
The Role Of Orchestration
Geo-redundancy orchestration has some features in common with multi/hybrid cloud, regardless of whether the individual sites are hosted on the same cloud. Cloudify, being cloud agnostic, can represent each site as a separate blueprint, while representing the overall orchestration as a containing blueprint, using the service composition feature. The individual blueprints represent and manage each site separately, whether it is on a public cloud, private cloud, or Kubernetes cluster, or some combination.
Once defined, each site’s initial installation can be performed individually, or in one fell swoop by the parent blueprint. Via blueprint capabilities, the individual sites can expose VPN connection details if needed, which the parent blueprint can use to initiate data replication. Of course, such replication initiation is not needed if the service in question is using a geo-redundant cloud data store. The parent blueprint can also configure a cloud based load balancer to distribute traffic, if dumb clients are used.
A Concrete Example Using AWS
In the below example, Cloudify is shown orchestrating a geo-redundant application using AWS infrastructure. The use of AWS is incidental, and other providers can be used as needed.
From the top of the diagram, Cloudify’s portal/console is shown presenting a list of pre-created environments to create. The console also has a graphical, map interface to show where individual sites are located. Once the geo-redundant service topology (blueprint) is selected and executed, the deployment of the three sizes (availability zones) proceeds.
Rather than have a database in each site, this design utilizes AWS RDS to provide geo-redundant data storage. Also, to provide geo-redundant request routing, Amazon Route 53 is used to reliably route traffic to the various sites.
Cloudify, through its extensive portfolio of infrastructure and service integrations, as well as hierarchical templates, is able tie together all the components of the architecture into a seamless whole. Besides day 0/1 provisioning, Cloudify blueprints can be extended to support arbitrary day 2 tasks that can span all sites as needed.
Geo-redundancy is an important piece of the puzzle for assuring an uninterrupted user experience, but it has complexity consequences, and by itself isn’t sufficient. Quality of experience is something that needs to be designed within – from the ground up. In order to manage multi-cloud geo-redundant complexity, sophisticated automation is necessary. Ideally that automation platform is not tied to specific vendors, yet compatible with any and all, and built to be extended. It also should be open, both source-wise, and architecture wise. Cloudify is a platform that can check all the boxes and has been used in production by some of the biggest players in the business.