- February 27, 2014
- Posted by: Uri Cohen
- Category: Uncategorized
Automation as We Know It Today
If you haven’t been hiding in a cave in the last year or two, you’ve probably heard the terms devops and infrastructure automation more than once… But even today, infrastructure automation is mostly focused on setup and deployment of complex systems. For example, if you’d like to deploy your application to the cloud, you would likely automate the steps of provisioning the cloud resources, installing the right components on top of these, and then orchestrating the startup of your components. Take even the simplest application that has a web server and database. After installing and configuring everything, you’d first need to ensure that the database is started, and only then the web server. You’d also need to propagate specific runtime information from the database to the web server, such as the database’s host and port. This stage, for the most part, is where most automation processes focus on today.
DevOps: Stage Two
Automation, however, can do much more for you. Let’s imagine, following your initial automation of the setup and deployment of our your application, that you want to make changes to your infrastructure or application. For example if you have a patch you’d like to install to the database or operating system, or if you want to update a piece of your code on your cluster of web servers. Both of these concerns involve pretty complex processes that you need to automate.
For an infrastructure upgrade, you would typically roll out the upgrade server by server, take down one server at a time, update it, then restart it. Even though it doesn’t happen very often, there’s still a lot of benefit in automating this process and minimizing the chances of human errors.
Cloudify – monitoring like a boss of your entire app lifecycle. Try it out. Go
Pushing new code is a bit different. For starters, you do it much more often, even a few times a day. Also, (no offence), it is likely that the stability and quality of your code is lower than that of an approved patch to the OS or database. So the risk of something going wrong if you don’t automate is even higher with this process.
The former derives its name from the mining days, when a canary bird was sent out into the mine before the miners, to check if the air was toxic or not. So in this fashion, a “canary instance” is chosen, the new code is pushed to that instance alone, and then a number of sanity tests are performed. If these pass, the code is then pushed to a few more instances. If all goes well again, the code is then sent to the rest of the infrastructure.
In the latter strategy (A/B), the code is sent to only a part of the servers. In an even more sophisticated approach, the code is only enabled to a portion of the users, for example, a certain geography or time of day. All of this ties into how to setup and manage an ongoing system.
But although this process, as well the infrastructure upgrade, can get quite complex, especially when dealing with large scale, live systems, in most organization code pushes and upgrades are still not automated. But there are more than a few companies that are blazing the trail for the rest of us.
The Plot Thickens – Monitoring Like a Boss
So once you finally saw the light, put in your sweat and tears and managed to automate your setup, upgrades and code pushes. What’s next? This where it really becomes interesting, and where you can really take your business to the next level, with the monitoring of the entire system, and eventually, what you do with the aggregated data.
But let’s first discuss what it means to monitor your application, and what kind of data you can extract from it.
At the most basic level, you want to know whether your application is available for your users. It sounds very basic and simple, but setting it up properly is actually not an easy task. It is intended to give you an answer to the most important question to your business: can my users use the application? Think of it as a big red or green light on your dashboard. Although the answer may seem obvious and simple, but it is not at all. There are all kinds of parameters to take into account when answering this question, for example, what if the system is apparently up and running but the response time is very slow, or what if only parts of the system are actually running. At the end of day, you want to have that kind of an alert “traffic light” that tells you whether your system is functioning – yes/no, up/down.
System, Application, and Business Metrics
The green-red traffic light is an aggregation of multiple levels of monitoring. The most basic one is system related. It contains indicators like process availability, CPU and memory utilization, etc.
The next level of monitoring is application related and is composed of application level KPIs (key performance indicators). KPIs can be anything from the average response time for a user for a certain request, to the number of concurrent database connections. In other words, anything that’s specific to the application and its architecture.
And then above these, comes the highest level of metrics, which are business metrics. Ultimately these are what’s really important. If your business metrics are ok, then you can assume everything is running fine. Sample business metrics include the rate of failure to register to your website, how many users did a certain operation in the website or how many users executed a specific transaction. If we want to be really specific, then take Google hangout for example, a business metric of theirs would be how many people in a certain timeframe joined or started a hangout.
Logs to the Rescue
Logs are often used here as means to collect your metrics and alert about erroneous conditions. Logs usually contain a lot of useful information, but you need to be able to extract it and make sense of it. This means gathering all of the logs emitted from all of your servers, parsing them and looking for specific patterns that will help you generate the KPIs over time. For example, you look for a log message that says “User X signed in”, and by counting these messages you can produce a business level metric for how many users register over time. If you add some more data to the log message, such as the user’s location, or operating system, or anything else for that matter, you can slice and dice data and get deeper insights into how your application is behaving. When logs are emitted in a structured and consistent way (e.g. in a JSON format), it becomes very easy to analyze them and produce the relevant KPIs. There are many tools that can help with that.
From Simple Automation to Orchestration
That’s where orchestration comes into the picture. At the most basic level, orchestration is a higher form of automation, which helps you setup all the pieces that are related to your application, starting from the infrastructure (VMs, networks, block storage volumes, security groups, etc.), to the platforms your app runs on (database, web server, etc.), and all the way up to the application modules and code. This entire setup is often referred to as a topology. The role of an orchestration framework is to materialize a certain topology. More advanced orchestrators go beyond materializing the topology, and change it to meet the current workloads and needs of the application.
As you probably gathered by now, monitoring and log gathering are an essential part of any running app, so they should be an integral part of the orchestrated application topology. Since the orchestration process is topology aware, it can wire and configure monitoring for your application components very easily, which is one of its greatest benefits. Going through these process without a global view of the topology can often be time consuming and error prone. Moreover, as the topology changes , you need to reconfigure and rewire your monitoring tools, and a good orchestrator will that for you as well.
Next Up: Reactive and Proactive Orchestration, The Devops Holy Grail
Hopefully by now you see the value of orchestration as a higher form of automation. In the next post I’ll dive deeper into the post-setup phase, and discuss how orchestration tools can react to events and monitoring data and adjust the application’s runtime topology to best fit the current workloads.