Agentless Auto-Healing Of VMs in Cloudify

One of Cloudify Manager’s distinctive features is the ability to perform automated day 2+ operational tasks. These typically boil down to analyzing an event stream produced by agents, and reacting by running workflows. Auto-healing is one such task. A metric is identified as a trigger, and when it fails to be delivered to the manager for a period of time, the heal is triggered. This post is about adding a capability to Cloudify that emulates the simpler liveness probes style functionality found elsewhere (e.g. Kubernetes). Note that this plugin only functions on Linux systems currently.

Users Guide

The agentless autohealing capability is packaged as a plugin. The plugin defines a single type, and a single relationship. The type, cloudify.nodes.Healer, mainly serves as a configuration node. The relationship does the heavy lifting. The idea is that the Healer node is connected to a compute node via the cloudify.relationships.healer_connected_to relationship.

 

 

The Healer has a few built in kinds of “probes”:

  • ping – Simple ICMP liveness check to the IP of the connected compute node.
  • http – HTTP GET to a URL on the IP of the connected compute node.
  • port – TCP socket open on socket on the IP of the connected compute node.

Configuration

All probes are based on the same operational model. Given an IP address (determined by the target of the cloudify.relationships.healer_connected_to relationship), run each probe with a specified frequency. Then, when the probe fails count times, it launches the heal workflow in Cloudify for the target compute node. The heal, is handled by Cloudify, and the probe resumes. Note that since the probe has to trigger a workflow, it must have Cloudify credentials. These should be put in the secret store with format "username,password,tenant_name".

The plugin.yaml file has more detailed description of the configuration of each type of probe. The plugin has an “examples” directory with some simple OpenStack blueprint that illustrate the use of the different probe types.

Logging

During installation of the containing blueprint, the plugin starts a daemon on Cloudify Manager that runs the configured probe. The logging from that daemon cannot be pushed to a meaningful location in Cloudify Manager, because it isn’t associated with any execution. So currently, the log is placed in the /tmp directory with the filename pattern of healer_<deployment_id>_<pid>.log.

A Little Deeper Dive

The plugin operates by using the Cloudify programmable relationships feature to trigger the creation of what amounts to an agent (or daemon) of sorts on the manager. The only purpose of this daemon is to run the configured probe and trigger the heal. The daemon is implemented in the cloudify_healer/healer.py file. When the configured probe fails, healer.py runs the heal workflow and exits.

The heal workflow operates by tearing down and relinking relationships. When the compute node is restarted, the manager will relink the relationship from the Healer type to the compute node. The relationship establishment causes a new daemon to be spawned.

  cloudify.relationships.healer_connected_to:
    derived_from: cloudify.relationships.connected_to
    source_interfaces:
      cloudify.interfaces.relationship_lifecycle:
        establish:
          implementation: healer.cloudify_healer.launcher.launch
        unlink:
          implementation: healer.cloudify_healer.stopper.stop

Note the launcher.launch, which is triggered during the initial install and after every heal. Also note the unlink implementation (stopper.stop). This is also triggered during a heal, but is basically a no-op, since by definition the heal caused the daemon to exit. It’s purpose is to clean up the daemon during a normal deployment uninstall process.

Future And Limitations

One glaring limitation of the current version is the simplicity: a single probe heals a single compute node. There is no support for healing groups, or multiple relationships triggering more complex probes. This is a possible future enhancement.

Another limitation is that you can only use three types of probes. Ideally, to fully mimic a typical script based approach, would be to allow plugin users to supply their own probes. This is on the short list.

Another limitation is scale. You don’t want thousands of these probes running on the manager. It is conceivable to imagine more full featured multi-threaded “probe servers”, perhaps sprinkled throughout large deployments. I think that’s probably a bridge too far, and if you need that kind of scale, you should probably use the modern agent event driven approach, as opposed to a polling approach.

Conclusion

Even given current limitations, I think this plugin can be a useful part of the Cloudify problem solving toolkit, especially for situations where agents are undesirable but auto-healing is desirable. When zooming out from the approach I used here, one might ask if can be generalized. The answer is certainly yes. Auto-scaling is an obvious next step, although trickier than auto-healing. Zooming out further, what the plugin does is run workflows in response to the output of a function. Generalizing the plugin to that point would bring it closer to (and in some ways exceed) the policy model in Cloudify, which does allow arbitrary workflows to be executed automatically, but is limited to metrics. We’ll see where developments lead. The source is here. Comments welcome.



Leave a Reply