TOSCA, APM, SLA, KPIs, OMG…! All About Topology Driven Monitoring

Today, like pretty much everything else in the IT world, monitoring is also a siloed area, and falls into two general types of tools — infrastructure monitoring and application performance monitoring (AKA APM).  These tools for the most part don’t really interact, and so, it can be quite a challenge to have a holistic view of your entire application from the infrastructure, through the middleware tier, up to the application level.
Just for context, I’ll give a brief overview of both types of monitoring.

Infrastructure monitoring comes to monitor technical assets starting from hosts, networks, and storage where you typically monitor things such as availability, and basic KPIS like memory consumption, disk space, network speed, network connectivity, and such.  These monitoring tools have existed for a very long time, and are largely owned by IT operations.  Their strength lies in providing a good indication of green/red – up/down, and don’t provide much in terms of application context. This is especially true if there is no manual labor invested in terms of mapping the infrastructure to the application context.  What’s more, it is also very difficult to know how your business is doing with regards to those KPIs, as such a metric like high CPU utilization for example, doesn’t really tell you if your application is doing ok or not, it is just an absolute value that can have any number of reasons under the hood.

Cloudify 3.0 monitoring & cloud orchestration in a box. Try it out.  Go

Next, we have a second set of tools that come to monitor the application performance or APM.  With these tools there have been different directions and evolutions to cater to diverse and growing needs, and they are typically owned by application people.  The nice thing about them is that they measure the speed of user activity, business service activity, and then depending on the tool, they can sometimes point to the area where the business transaction slows it down (sometimes even to the line of code that slows it down).  These tools are there for the application support people to be able to connect with the application development people and achieve quick problem determination and remediation.

Although these tools give you the runtime flow of the business service, even across the machines and technologies sometimes, they too do not have the knowledge of the application’s entire topology, nor the application model.  What we found when designing Cloudify 3.0, is that the tools in both of these worlds, are lacking a very important aspect.  They both only start to monitor your application after it is up and running, that is after the application’s installation.

Since they are clearly unable to monitor the installation of the application, and as a result, they too are unable to understand if the application topology actually meets the planned and required topology.  If you’re talking about a traditional data center this is a negligible problem.  This is because these environments are quite static, and you really only set up an environment once, and it usually stays the way you initially set it up.  So to overcome such challenges, you would generally only need to do some manual mapping, and wouldn’t really need to monitor change in the deployment itself, only changes in performance, for the most part.

But how about a more elastic deployment where everything is created by automation tools or APIs (clouds as the most prominent example)?  These can change as often as a virtual machine crashes and is replaced by another machine, and when a scale out process is triggered,  whether automatically or manually.  In such an environment (virtualized and elastic), application topology becomes an ever-changing creature, and so it becomes pretty much mission-critical to understand how the topology changes, where these changes have stemmed from, and how they apply to the application model.  Basically, what each node’s function is in the topology model.

This is yet another reason why your application setup and automation need to be handled by an application-centric orchestration tool, that comes from an application model standpoint.  This is in order to actually create and automate the application, knowing what the application is, how the topology should look, as well as support monitoring of these installation or changed processes.

That is exactly what we tried to accomplish with Cloudify 3.0.  Cloudify’s most fundamental concept is application topology.  Following the TOSCA standard, Cloudify blueprints model application components on all levels – starting from infrastructure, hosts, load balancers, networks, routers, storage volumes.  Going up the stack to middleware containers like, databases, application servers, web servers, message busses, then up through the application level, monitoring different application logic, deployments, application database schemas etc.

In addition to this, Cloudify understands the relationships and dependencies between the different components, what’s connected to what, and what’s dependent on what.  Lastly, the blueprint also holds the SLA for each one of these components, i.e. the number of instances, where they should be located, availability zones, and so forth.
On top of this, each automation process is modeled as a workflow that interacts with the topology model translating it into a series of commands to be executed.  Coming from this kind of thinking and modeling it is very easy for Cloudify to give the user a full status of the application topology, whether it’s still in creation, in change or whether it’s just up and running.

Cloudify does this by plotting and reporting fine-grained events following each command that executes one of the component’s installation, startup, or manipulation, and even traces the success or failure of that command execution.  The result is a graphical topology with statuses, as well as an events feed in the UI, along with a progress panel for each and every component.

But all of this is just the beginning. Cloudify’s next version, 3.1, will also analyze all of the metrics reported by the different monitoring tools — whether traditional ops tools or APM tools — and will put those metric in the context of the application components and model.  It will also provide the option to customize policies that will trigger the type of scale and change/remediation processes mentioned in this post.  This will ultimately break down the siloes between the existing monitoring tools and integrate them into the same tool chain providing the entire picture of your application from infrastructure through the application itself.


    Leave a Reply

    Your email address will not be published. Required fields are marked *

    Back to top