Deployments at Scale – How to Scale to 1000+ Node Instances Effortlessly

Once upon a time, large scale deployment required a large IT team to provision and organize resources.They were needed for example, to buy machines, set up hosting, guarantee network bandwidth, and so forth. Bearing in mind that this was all before the actual application code could even be deployed to the machines, and the whole thing could be started up.

The cloud world, and the DevOps tools that have come about, have significantly simplified large scale deployment. Though, some of the fundamental issues surrounding large scale deployments are yet to be solved.

Undoubtedly, building an application aimed to handle 1000 concurrent users is considerably different to building an application that can handle 100 million concurrent users. With users and data growing at an exponential rate, these days, applications just can’t afford to not meet the load. It’s not just the raw computing and storage powers that are required to handle such volumes, but also the system’s inherent design and algorithms that must “scale” to meet the requirements to manage application deployment.

Cloudify 3.0 – deploy thousands of node instances without losing control.  Go

Application orchestration and application deployment automation often calls for intimate and continuous integration with the application. This integration is needed in order for it to detect failure in sub-seconds and take corrective actions when necessary. This generally enforces a more complex scalability challenge. 

Cloudify 2.x uses a hierarchy of managers whereby each manager controls 100+ nodes as a scaling architecture. One thing we experienced over the last year is a growing class of services such as big data, and Network Function Virtualization (NFV) in which a single service can consist of hundreds and potentially even thousands of instances. This growth meant that manager of managers’ approach didn’t suit those use cases.

To enable scaling of thousands of nodes per manager we decided to use a message broker approach based on AMQP as well as separating the provisioning, logging and real-time monitoring of tasks into separate services that can scale independently. This allows us to also control the level of intimacy between the Cloudify Manager and the application services. For example, we can tune Cloudify to handle the provisioning and logging only, and use the real-time information already gathered by the infrastructure. This is specifically important in the context of OpenStack since we expect that a large part of the metrics can be gathered through the infrastructure itself.

With Cloudify 3.0, large scale deployment was very much one of our core targets, when we chose to redesign our product.  We constantly faced the question, what devops deployment strategy is really needed to deploy an application that requires 1000 compute instances?  And we found that the main points of this ultimate question were:

  • What does it take to provision the machines correctly?
  • What does it take to install them correctly?
  • What does it take to monitor the deployment and lifecycle of such applications?

All of these are core elements that need to be resolved for any large system to work in the real world, and with Cloudify 3.0 we wanted to make as many of these issues as simple as possible (but no simpler than that).

The components

With this in  mind, we rolled up our sleeves and got to work on Cloudify 3.0, constructing it out of simple yet very scalable components. Each use DevOps to scale up to meet the requirements of a large deployment, and can be deployed in a manner that matches the performance requirements on such a scale.

  • Elasticsearch – Cloudify uses Elasticsearch both to store runtime information (it’s actually a little known fact that Elasticsearch can serve as a NoSQL database), as well as, in its more classic capacity to index system logs and events, making them available over API calls.
  • RabbitMQ – The poster child for scalable messaging systems, RabbitMQ has been used in countless large scale systems as the messaging backbone of any large cluster, and has proven itself time and again in many production settings.
  • GUnicorn/Flask – This lightweight web framework enables quick REST API development while enabling easy clustering, and the usage of multiple worker processes required to scale the REST API service to meet high volume requirements.
  • NGinx – This functions as our load balancer and traffic cop directing API and incoming requests to the correct component.

Scaling out agents

It is interesting to note that the development of the client-oriented components (i.e. the agent software that is optionally installed on machines running client software), was also impacted by the core components upon which Cloudify 3.0 was built.  By limiting the traffic from agent machines to the management servers to the use of RabbitMQ messages only, it was possible to use RabbitMQ’s many features to make sure the management servers do not get flooded by peaks of incoming data from agents. The ability to scale out RabbitMQ also ensures that we are able to handle a large number of agents without compromising the overall stability of management services.

Scaling log and event collection

Using Elasticsearch and Logstash for the processing of system logs and events also turned out to profit from the RabbitMQ integration, as logs and events from agent machines are also relayed over RabbitMQ, again preventing Elasticsearch and logstash from being flooded by using RabbitMQ’s messaging facilities.

Scaling the provisioning process

In many use cases, the process of provisioning a large number of compute instances can also be a challenge, depending on the cloud you’re using. Some cloud infrastructures do not support the batch allocation of compute instances. Others have ‘finicky’ APIs that may require some requests to be retried. As a result you oftentimes have to send out separate API requests for each compute instance (or otherwise use very small batches).

Moreover, orchestrating the creation of the correct number of instances, managing errors and retries, and finally managing the deallocation for these resources, is in itself a challenge. To handle this we again use RabbitMQ and the Celery Project framework, in conjunction with a self-developed workflow engine (which will be described in a separate post).

So in summary what you receive, through the combination of the tools and techniques  and using devops to scale up is:

  1. Managed orchestrated allocation of resources,
  2. Orchestrated installation of client software, including the handling of dependent services, parallel installations (fork/join behaviors), etc.
  3. Monitoring the application lifecycle including log and event collection, and
  4. Orchestrated managed de-allocation of resources.

This makes the deployment of a large application that much easier.

By delegating all of these tasks to Cloudify the application developer can focus on core functionality and leave the ‘plumbing’ to us.  Or as mentioned in our previous post – focus on what they love best – programming.

comments

    Leave a Reply

    Your email address will not be published.

    Back to top