In this post we outline a pattern for implementing the Warm Disaster Recovery pattern across two Amazon EC2 regions.
Disaster Recovery in the Cloud
Accomplishing a suitable disaster recovery strategy is crucial to many enterprises. However, the means to achieve an appropriate level of redundancy is challenging given the stark financial and technical tradeoffs it imposes. At the heart of every disaster recovery planning initiative, one is limited with financial constraints, optimizing infrastructure recovery, and achieving suitable data recovery without an adverse impact on latency. Thus, it boils down to a cost calculation trade-off: balancing recovery time/point against the cost of building it. In other words, at what point does the cost downtime become much more expensive than managing a fully operational disaster recovery environment or vice versa? As an antidote to these constraints, I’ll present a solution architecture that aims to optimally balance cost and recovery time via three core principles that are germane the cloud world:
On-Demand: The disaster recovery cloud can be provisioned on any availability zone, region, or public/private cloud through Cloudify’s cloud-agnostic bootstrapping mechanism.
Elastic: The ability to automatically provision resources in the recovery cloud in case of disaster while eliminating the need for idle resources in normal scenarios, thereby fully profiting from the pay-per-use pricing model of clouds.
Flexible RTO/RPO: The architecture can be easily extended from a warm DR to a hot DR pattern through enabling/disabling application recipes. This allows us to exploit economies of scale that the cloud provides by matching the number of recipes/tiers to provision (in the recovery cloud) against the recovery time/point objective for our disaster recovery strategy.
Solution Architecture: Elastic On-Demand DR
Disaster recovery patterns come in many different types, depending on the desired recovery time (RTO) and recovery point (RPO). For instance, in a Warm DR approach, only the data is replicated to another data center, while stateless servers (typically application servers) are only started in the “failover” data center once a disaster occurs. While this deployment is resource effective, it increases recovery time. A Hot DR pattern on the other hand, deploys the entire stack on two separate data centers, and fails to the other during a disaster. In this case the recovery time is much shortened, however resources in the data center are mostly sitting idle. In this context, our implementation is based on warm disaster recovery architecture, yet with the ability to automatically provision resources in the recovery site in the event of a disaster. While this architecture is applicable across any public cloud (HP Cloud, Rackspace, AWS), this is how it would look like on Amazon EC2:
In the primary cloud (west region) all the application tiers are serving read-write operations. There are also one or more clouds serving as DR clouds, such that only PostgreSQL tiers are running. A mechanism for heartbeat polling between the master and recovery cloud is provided through Cloudify agents running on the instances in the recovery clouds. The polling code is implemented within Cloudify lifecycle events in Groovy and can be easily changed to poll utilizing different protocols as well as at arbitrary time intervals. When the primary cloud fails, the system enters a failover mode, which causes the recovery cloud (DR site) to support the full application in read-write mode by spinning up instances hosting the remaining application tiers.
Design and Implementation
The application tier in our application contains transient data that can be easily recreated and doesn’t require any form of backup policy. Thus, all application tiers can be provisioned on demand through Cloudify service recipes on the DR cloud. The data tier however, utilizes streaming replication in PostgreSQL to ensure data consistency. Such an application architecture is a natural fit for a warm DR scenario, since only a partial set of resources are required to implement data replication without having to run the entire application in the DR cloud.
In our implementation, we provide two application recipes: the first recipe, petclinic-master, is responsible for implementing the onboarding of the application deployed on the primary cloud. The second application is an extension of the first one with additional disaster recovery requirements such as data replication and automatic failover.
Since the cloud provider specifics are abstracted from us via Cloudify’s cloud driver, all we have to do is pass in a cloud overrides file that describes whether we want our cloud deployed on a specific availability zone, region, cloud providers, private cloud, or even a bare metal data center. In other words, switching between multi-region, zone, and cloud providers is simply a matter of supplying a different configuration file.
To show how all of this works in the real-world, the screen cast below should provide an illustration: