The Need for Automation at Scale: The use cases for automation have evolved over the years. From automating the configuration management of a relatively static data center on a single machine, to automating the entire infrastructure stack across many sites and even edge devices.
According a Flexera report the top 3 growing cloud services are IOT, Containers and Machine Learning- these use cases tend to be highly distributed.
In the DevOps world, the most common method of automation was done via a simple command line tool as in the case of Ansible or Terraform.
As we move to automation of IOT, Machine Learning, and multi Kubernetes clusters across multi-cloud, the practice of using a CLI to drive the automation of these systems becomes insufficient- particularly as we have to deal with coordinating distributed processes that are interlinked. Handling automation at this new scale requires a different approach.
Overcoming the Challenges
Orchestration of a distributed system at scale creates many challenges:
- Large scale resource management: managing millions of deployments across multiple sites.
- Concurrent execution: allowing concurrent execution of workflow between deployments.
- Latency and security – sites can be running under high latency environments and non-reliable networks. The workflow execution needs to be tolerant enough to handle this situation.
- Distributed execution – Workflows can span between environments and sites.
- Security – Managed environments can often run on a secured and sometimes air-gapped environment with no remote access.
- Offline – The orchestrator should be able to run all resources from a local repository, without dependency on remote resources from the internet.
Orchestration at Scale Using Cloudify
Cloudify has been designed to deal with scale by utilizing server architecture that can handle many concurrent distributed executions- achieved by using RabbitMQ as the workflow execution engine and Postgres cluster for managing the state of deployment and execution.
This worked great at the time, but as the use cases became more demanding, there was a need to improve the current architecture so that scale would never be an issue. As Cloudify’s original goal was to handle thousands of deployments on a single manager to 10’s of thousands or even hundreds of thousands, we achieved way more than planned as versions rolled out.
It’s true to say that with scalability, you’re only as strong as your weakest link. In order to improve scalability, we had to optimize every layer of the Cloudify server. This included the following:
- Active / Active manager cluster – Previous versions were based on an active/standby cluster. This means that scalability was limited to the capacity of a single manager. It also meant that adding more managers would have diminishing returns, as it added synchronization overhead. With the new release Cloudify moved to an active/active cluster, allowing each server to take part of the overall load- therefore a single cluster could scale close to linearly as more manager instances are added.
- Active / Active MQ – Cloudify uses RabbitMQ as the message broker. In previous versions Cloudify used an Active/Standby RabbitMQ cluster architecture (simply because earlier releases of active/active clustering were not mature).
- Standard DB Clustering based on Patroni – In previous versions, Cloudify used a proprietary clustering of Postgres. With the new release Cloudify moved to Patroni, which is a popular Postgres clustering management framework.
- External DBaaS – Most public clouds already provide a managed Postgres DBaaS. In this release Cloudify allows users to take advantage of the managed DBaaS on AWS and Azure. This significantly simplifies the operational management of a Cloudify cluster on public cloud.
New Cluster Health Based on Prometheus
- The previous release of Cloudify used relatively simplistic cluster health monitoring. The new release moved to Prometheus, to monitor all the various cluster service states, and now has a more granular cluster health state.
With its latest release, Cloudify went through a series of performance optimizations, homing in on how to handle DB queries and also the management UI. These optimizations ensure linear scale even when there are already many concurrent deployments in the system. A summary of the results are provided below:
- Tested with over 2 Millions deployed nodes
- 1000 workflows per hour on a single box
- Over 5000 workflows per hour on a Cloudify cluster
Read more details on this benchmark here
With Cloudify 5 we broke the monolithic manager architecture (all in one), that was part of previous architecture, into individual services: including the manager, MQ and DB. Each of these services can be maintained and scaled independently. To balance between the sizing and robustness of the system we support a 3 node cluster. All the services share the same compute node as well as 6 and 9 nodes. Each service can run on its own dedicated compute cluster.
Infinite Scale with Cloudify Spire
Cloudify Spire enables users to scale between multiple clusters and sites. Users can reach infinite scale as each site is completely independent and runs autonomously. Adding more capacity becomes a matter of bringing more clusters and spreading the load between those clusters
What’s Coming Next?
Our goal is to continue the current journey in few ways:
- Simplifying the cluster management and maintenance significantly in public cloud by leveraging built-in managed cloud services such as storage , MQ services in addition to the DBaaS (already supported).
- Complete the transition to a full Cloud Native Architecture and allow users to run Cloudify through a management Kubernetes cluster as non privileged containers.
- SaaS – provide a fully managed Cloudify manager cluster.
- Simplify the continued cluster management and upgrade processes to allow more granular updates of individual components without disrupting the running manager.
The new architecture and scale is proving particularly useful for the following use cases:
- Manager consolidation – Leveraging scale and performance improvements to consolidate Cloudify manager into a shared cluster.
- Manage multi-cloud environments through a common platform – Managing many individual deployments on multi-cloud. In this case each deployment can be fairly small but the scale will be driven by the number of concurrent deployments and number of users accessing the system, as in the case of most of the enterprise CI/CD environment.
- 5G, Industrial IOT use cases – Managing workload across multiple environments, sites, Kubernetes clusters for 5G, and Industrial IOT use cases.
- Multi Site Edge/IOT – Azure V-WAN – Cloud providers such as Azure provide rich IOT and Machine Learning environments. Managing such environments can be a fairly complex and distributed task. Cloudify integrates with these cloud services to simplify management and automation, as can be seen in the Azure vWAN use case.