Deployments at Scale – How to Scale to 1000+ Node Instances Painlessly

DevOps at Scale

Once upon a time, deploying a large scale application required a large IT team to provision and organize resources, for example to buy machines, set up hosting, guarantee network bandwidth, and so forth. All this before the actual application code can even be deployed to the machines, and the whole thing can be started up.
The cloud world, and the DevOps tools that have sprung up around it, have simplified this process significantly, although some of the fundamental issues surrounding large scale deployments have still not changed much to date.
Obviously, building an application that can handle 1,000 concurrent users is very different than building an application that can handle 100 million concurrent users, and with the growth of users and data on an exponential scale, applications these days, just can’t afford to not be able to meet the load.  It’s not just the raw computing and storage powers needed to handle such volumes, it’s the system’s inherent design and algorithms that must “scale” to meet the requirements, as well.

Cloudify 3.0 – deploy thousands of node instances without losing control.  Go

Application orchestration often requires intimate and continuous integration with the application in order for it to detect failure in sub-seconds and take corrective actions as needed. That often imposes a more complex scalability challenge.
Cloudify 2.x used a hierarchy of managers where each manager controls 100+ nodes as a scaling architecture. One of the things that we experienced over the last year is that there is a growing class of services such as  big data, and Network Function Virtualization (NFV) in which a single service can be comprised of hundreds and potentially even thousands of instances, and therefore the manager of managers approach didn’t fit well with those use cases.
To enable scaling of thousands of nodes per manager we decided to use a message broker approach based on AMQP, as well as, to separate the provisioning, logging and real-time monitoring tasks into separate services that can scale independently. This allows us to also control the level of intimacy between the Cloudify Manager and the application services. As an example we can tune Cloudify to handle the provisioning and logging only, and use the real-time information already gathered by the infrastructure. This is specifically important in the context of OpenStack as we expect that a large part of the metrics can be gathered through the infrastructure itself.
With Cloudify 3.0, deploying large scale apps was very much one of our core targets, when we chose to redesign our product.  We constantly faced the question, what does it really take to correctly deploy an application that requires 1000 compute instances?  And we found that the main points of this ultimate question were:

  • What does it take to provision the  machines correctly?
  • What does it take to install them correctly?
  • What does it take to monitor the deployment and lifecycle of such applications?

All of these are core elements that need to be resolved for any large system to work in the real world, and with Cloudify 3.0 we wanted to make as many of these issues as simple as possible (but no simpler than that).
With this in  mind, we rolled up our sleeves and got to work on Cloudify 3.0, constructing it out of simple yet very scalable components. Each of these can separately scale up to meet the requirements of a large deployment, and can be deployed in a manner that matches the performance requirements on such a scale.

  • Elasticsearch – Cloudify uses Elasticsearch both to store runtime information (it’s actually a little known fact that Elasticsearch can serve as a NoSQL database), as well as, in its more classic capacity to index system logs and events, making them available over API calls.
  • RabbitMQ – The poster child for scalable messaging systems, RabbitMQ has been used in countless large scale systems as the messaging backbone of any large cluster, and has proven itself time and again in many production settings.
  • GUnicorn/Flask – This lightweight web framework enables quick REST API development while enabling easy clustering, and the usage of multiple worker processes required to scale the REST API service to meet high volume requirements.
  • NGinx – This functions as our load balancer and traffic cop directing API and incoming requests to the correct component.

Scaling out agents

It is interesting to note that the development of the client-oriented components (i.e. the agent software that is optionally installed on machines running client software), was also impacted by the core components upon which Cloudify 3.0 was built.  By limiting the traffic from agent machines to the management servers to the use of RabbitMQ messages only, it was possible to use RabbitMQ’s many features to make sure the management servers do not get flooded by peaks of incoming data from agents. The ability to scale out RabbitMQ also ensures that we are able to handle a large number of agents without compromising the overall stability of management services.

Scaling log and event collection

Using Elasticsearch and logstash for the processing of system logs and events also turned out to profit from the RabbitMQ integration, as logs and events from agent machines are also relayed over RabbitMQ, again preventing Elasticsearch and logstash from being flooded by using RabbitMQ’s messaging facilities.

Scaling the provisioning process

In many use cases, the process of provisioning a large number of compute instances can also be a challenge, depending on the cloud you’re using. Some cloud infrastructures do not support the batch allocation of compute instances. Others have ‘finicky’ APIs that may require some requests to be retried. As a result you often times have to send out separate API requests for each compute instance (or otherwise use very small batches).
On top of all of this, orchestrating the creation of the correct number of instances, managing errors and retries, and finally managing the deallocation for these resources, is in itself a challenge. To handle this we again use RabbitMQ and the Celery Project framework, in conjunction with a self-developed workflow engine (which will be described in a separate post).
So what you get, at the end of the day through the combination of the tools and techniques described above is:

  1. Managed orchestrated allocation of resources,
  2. Orchestrated installation of client software, including the handling of dependent services, parallel installations (fork/join behaviors), etc.
  3. Monitoring the application lifecycle including log and event collection, and
  4. Orchestrated managed de-allocation of resources.

This makes the deployment of a large application that much easier, out of the box.
By delegating all of these tasks to Cloudify the application developer can focus on core functionality and leave the ‘plumbing’ to us.  Or as mentioned in our previous post – focus on what they love best – programming.


    Leave a Reply

    Your email address will not be published. Required fields are marked *

    Back to top