Workflow Management – How to Build a Customizable, Scalable and Multi-Tenant Workflow Engine

When architecting our workflow engine for Cloudify 3.0, our primary goal was to be able to provide our users with a framework that will make it possible to write custom workflows that integrate seamlessly with our orchestration framework. Based on this premise, we knew our workflow engine should be aligned with our blueprint object model.
As part of our DNA we know that the speed of innovation these days, is often influenced by the speed of adoption of new and exciting tools, and specifically open source tools.  Open source projects have been growing at an exponential speed over the course of the past few years; however the process for exploring, testing, configuring, and ultimately integrating these in production often times is still quite complex and time consuming.
After our initial research we chose to begin building our workflow engine with Ruote, a workflow engine written in Ruby that has its own workflow language called Radial. However, we quickly found ourselves with too much of a learning curve from an ease of use and debugging perspective. We then ran into issues with its ability to scale to meet the needs of our model when it comes to deployments with many node instances.  Our architecture requires the execution of many tasks in parallel on many nodes, and this simply wasn’t possible with Ruote. So while Ruote is a great tool, it simply didn’t suit our needs. Despite receiving a lot of help from the community, we found ourselves in danger of not meeting our deadlines and had to make the decision to discontinue our development with Ruote.
Python, generally speaking, is a very rich language with a huge selection of tools and modules. That said, while there are plenty of options in the Python world where any number of these could potentially suit our needs, we just didn’t have the time to go through the process of researching and adopting a new tool again.
This is when we had to make a decision.  We reached the conclusion that writing the minimal framework necessary that would definitely suit our needs, and meet the requirements of our product, would be the smartest course of action.  So we set out to work on our own custom workflow engine that would be able to scale, and execute any number of tasks, on any number of node instances in parallel.

Cloudify 3.0 – DIY Workflow Automation the Easy Way.
Check it out.

This is when we decided to leverage Celery even further.  Celery, already being part of our stack where it is used as a task broker for ops purposes, had proven itself, what’s more, is also a tool that has been proven in production time and again in many projects.  So it was really a no brainer to move ahead with development on top of this excellent tool.  This decision would also mean less moving parts, and less of a learning curve for our users, which would simplify the experience significantly.
The way we use Celery within our stack is basically as the executing task broker.  Workflow executions are pushed to the queue and are consumed by a Celery worker dedicated to these workflows.  This is done per deployment, making it highly tenant-based, enabling the execution of as many tasks as necessary in parallel.  On top of this, Celery in itself is highly scalable by design, and it is very easy to add many workers on multiple machines.  In this way, the entire architecture of the management is extremely modular.
The API,once the workflow task has been executed, is based then upon the blueprint model.  All the methods, contexts and data that is provided to you when you write the workflow is then easily mapped to the existing blueprint model.

When we talked about why we chose to build our own workflow, besides the time constraints, another major driving factor, was to enable us to write the workflow engine exactly as we wanted it to be.  This way, workflows can be written exactly like Cloudify plugins in many ways, so any person who writes plugins can rather easily move on to writing workflows.  This makes it pretty simple to write complex workflows that map to this specific model, essentially saving the boilerplate for you to reuse.
Similarly to plugins where everything maps to context objects, which is the main API integration point, the custom workflows allow you to access everything from the blueprint including all its services.  We pretty much built the exact same thing for workflows, so it’s easily readable and looks the same to the user.  In this manner, there is a workflow context object, offering context data and services, just like we have for our plugins.
On top of this you can use an advanced task graph framework which helps you to schedule tasks and the dependencies between them.  This enables the use of the exact same standard API for more advanced scenarios.  Besides simplifying task scheduling, this also gives you basic support for various aspects of workflows, such as cancel support, which is you would otherwise have to implement yourself.
When we talk about TOSCA in general, and having a templating standard, this is something we need to be sure is true across our entire product.  We wanted to provide an easy and seamless user experience with the smallest learning curve possible, basically minimizing the friction in the adoption of our open source tool.

About the Authors

Dan & Ran are a software engineering powerhouse on the Cloudify team at GigaSpaces.  When they’re not busy bringing you Cloudify goodness, they’re head down working on awesome blog posts like these.  Catch them on Github – dankilman | ran-z.


    Leave a Reply

    Your email address will not be published. Required fields are marked *

    Back to top