Cloud Native Transformation and Multi Cluster Management
The growth of adoption for cloud computing has been growing at a fast clip, and public cloud infrastructure (IaaS) spending has crossed the $150B . On this infrastructure run the familiar cloud platforms, dominated by Amazon AWS and Microsoft Azure. On the application front, the movement has been rising, and adding a layer of abstraction on top of the virtual machines traditionally dispensed by the cloud providers. The cloud native platform of choice is Kubernetes, including the major cloud providers that provide Kubernetes clusters on demand (e.g. AKS, EKS, GKE). In addition to these developments, the use of multiple cloud platforms has grown to be commonplace in the industry. This article discusses the foundational technologies behind cloud computing, containers, Kubernetes, cloud native meaning, and the challenges related to delivering an operating application in a multi-cloud, multi-Kubernetes-cluster environment.
The journey to cloud computing came with the advent of the virtual machine (VM). VMs are essentially computer hardware emulated in software. The concept of a virtual machine is almost as old as computing, and early versions of it began appearing in the 1960s. The compute power of the time was limited, and VMs were really only practical on mainframe and minicomputers. As the compute and memory capacity of computers grew over the years, especially microcomputers, the capacity growth greatly outpaced the needs of typical software applications and VMs became practical. A single machine could masquerade as multiple machines, and maximize its utilization (cost efficiency). Because they emulate hardware, and are supported by chipsets, virtual machines provide excellent isolation when sharing a compute host.
If VMs emulate machines, containers just slice one up. “Container” is a name for a collection of technologies that can provide isolation for ordinary processes. An early form of this technology was provided by the chroot system call, which set the effective file system root for a process and its children. This provided a ‘jail’, limiting the damage the process could do if it had access to the real ‘/’ filesystem. Containers use many modern operating system kernel features to provide sophisticated file systems, isolation (various kinds of namespaces), and resource consumption limits (cgroups). A container is an environment for running ordinary processes on a compute host (including VMs), and generally has access to shared system libraries and sensitive areas on the host filesystem. As such, they are less secure than VMs.
Cloud Native Definition
The term “cloud native” entered the mainstream with the founding of the >Cloud Native Computing Foundation (CNCF) in 2015. It refers to a technological and cultural shift in the way applications are designed, deployed and managed, with the dynamism of cloud computing as an inspiration. What is a cloud native application? A cloud native application takes a service oriented approach, where applications are composed from independently operated services (microservices), and scalability and high availability are fundamental features. The idea (and tradeoffs) behind microservices is very similar to Service Oriented Architecture which was bound up with the web services craze of bygone years. The benefit is decoupling, and the costs are performance and complexity.
The granularity of a cloud native application is not a one size fits all proposition. Service decomposition directly serves the needs of agile teams to deliver frequently with minimum disruption. Updating a single service in a cloud native app is the equivalent of changing a tire on a running car. An adopter of cloud native principles can expect higher service availability and agility, as services can be added, removed, and repaired on demand. Microservices also offer more flexibility when it comes to fault tolerance and scaling as well. But these advantages come with a cost in terms of complexity and performance, and must be considered carefully on a case by case basis. For a deeper dive into cloud native concepts, see one of the definitive posts by Martin Fowler.
Kubernetes is a platform for managing cloud native applications. Kubernetes is written in Go and was open-sourced in 2014 by Google. Since then it has become one of the most popular open source projects on the internet, and really has no significant competition. Kubernetes is organized in hub/spoke architecture, using a control loop operational model. Work is done by multiple worker compute hosts at the behest of controllers running on the master. Managed entities (like containers) are described in manifests which are submitted to the master node, which triggers controllers that asynchronously fulfill the requirements of the manifest.
To address the popularity of Kubernetes, and the complexities of Kubernetes cluster management, several companies offer hosted Kubernetes as a service, such as Amazon EKS, Google GKE, Azure AKS, Redhat ROSA, and others. Kubernetes, which includes a cloud controller manager, is now positioned as a cloud neutral platform for application on public (and private) clouds, as well as edge computing environments. Applications can be written for Kubernetes (cloud native) and the portability aspects are delegated to the infrastructure layer.
The recent growth of multi-cloud computing and the emergence of edge computing extend to the cloud native approach as well. In this approach, workloads can be allocated to different clusters, much as a multi-cloud deployment can target multiple clouds. A benefit of the cloud native approach targeting Kubernetes, is that the cloud differences can be dealt with at the platform level. This is particularly attractive when edge computing is involved. Since Kubernetes is just as happy running without a hypervisor (bare metal) as running in the cloud, the same platform can be scaled from lightweight edge, to massive VMs in the cloud without special provisions in the application stack. This is precisely the value proposition of VMWare Tanzu<, which layers a standard Kubernetes distribution on top of ESXi.
Multi Cluster Use Cases
It’s important to understand that there isn’t a direct relationship between the number of clusters needed and the number of applications hosted. On Kubernetes, multiple applications can be deployed into isolated namespaces, providing the application with a kind of “virtual private cluster” experience. In other words, the need for workload isolation doesn’t necessarily require adding clusters. The need for multiple clusters can be grouped into a few categories.
Any company delivering services on Kubernetes will by necessity have multiple Kubernetes clusters. Advanced configurations will have automation to assist in software delivery across clusters designated for development, testing, potentially UAT, and production. Production itself may employ multiple clusters to enable canary rollouts and other reversible deployment schemes. The ability to automate the software delivery process is essential for establishing a repeatable, high quality process and delivering maximum value to customers as quickly as possible. An approach growing in popularity is Gitops, which integrates delivery with source code release processes. Multi Cluster configurations can be managed and versioned centrally in git, and utilized to produce repeatable deployments.
Operationally, this is perhaps the most obvious application for multiple clusters. Kubernetes workloads, properly deployed, are highly available by their nature, because of the closed loop approach Kubernetes takes to workload management. The “high availability” referred to here, is the availability of entire clusters. The ability to place clusters in different “availability zones”, or fault independent locations. Doing so naturally requires a highly available means of load balancing traffic across the clusters as well as assessing their health in an ongoing basis.
Software as a Service (Saas) companies that expand beyond a small scale require flexibility of workload placement based on security, regulations, and performance. The ability to easily migrate or establish new operations between the major Kubernetes cloud providers, and/or datacenters is a must to respond to market demands. Other considerations are high availability and dynamic scaling which are covered elsewhere on this list.
This category encompasses platform awareness/sensitivity and location sensitivity. Examples include:
- Edge computing, 5G, CPE (and others) that need workload proximity to service consumers generally based on latency requirements. Besides latency, the need for service continuity at the edge, regardless of the health of upstream services.
- The ability to scale workloads beyond a specific infrastructure provider.
- Provider feature sensitivity for workloads better suited to specific providers (e.g. security features).
Multi Cluster Management
A Kubernetes cluster can be a complex beast to manage, especially at scale. And any organization using Kubernetes will likely need to manage multiple clusters, possibly including development, testing, staging, and multiple production clusters. The usual tasks frequently associated with operating system deployments are needed, such as location-aware installation, upgrades, patching, and security configuration. Besides the clusters themselves, the cloud native applications need to be deployed and managed, made more complicated when deploying on multiple clusters. Like most highly complex systems, a layer of automation is required to orchestrate complex workloads in a way that avoids chaos, as well as manage day to day operations.
Controlling ingress and egress network policies, both inter-cluster (if needed) and per cluster. Managing role based authentication. The coordination and updating of security configurations is one of the primary challenges of managing multiple clusters (or managing any complex system). A standardized, policy based approach is the primary tool for managing the potential combinatorial explosion of systems x users x roles x permissions.
This can be handled by the Kubernetes cluster autoscaler in many cases on a per cluster basis, assuming a supported cloud based hosting environment. The need to autoscale clusters on bare metal is a possible but less common requirement and requires sophisticated external orchestration. Responsive multi-cluster automation opens the possibility of adding entire new cluster instances based on demand or rules/policies, along with decommissioning based on load or other criteria.
Cluster selection by workload requirements, which might include latency, CPU requirements, physical location, or other functional and business considerations. This is a common requirement in edge scenarios, but can arise in many other common situations including selecting the cluster with the lowest current load. A key enabler of dynamic workload placement is the ability to deal with clusters as abstractions for workloads. This enables workloads to be fungible between different Kubernetes clusters, and so lets an orchestrator bind to individual clusters at deploy time. For example, a single workload can be deployed to multiple clusters in a single operation, or moved between clusters when conditions dictate.
The coordination of intercluster networking as the result of cluster deployments or scaling and migration activities. This could involve configurations of ingress and egress, or the construction and management of NFV networks to facilitate communication between workloads. An example is provisioning the connection between edge workloads and centralized management/visibility. This doesn’t apply only to green field, pure Kubernetes applications, but also more common hybrid applications. In such applications networking between containerized workloads in Kubernetes and legacy systems must be configured in a secure and standardized way.
High level visibility into overall multi-cluster health, with the ability to drill down, discover causes, and effect remediation. Kubernetes’ closed loop approach to resource management has within it an effective approach to handling and recovering from many routine operational problems (i.e. auto-healing). However, error recovery is only as competent as the workload support for it, and still requires oversight. The adding of multiple clusters only increases this requirement. Root cause analysis on a widely distributed system is very challenging, especially ones with ephemeral services and based on asynchronous message passing. There is no magic bullet for this problem, but tools like service meshes and log analysis can help.
An emerging class of multi-cluster management tools is addressing the needs arising from the need to manage multiple clusters. These tools attempt to provide visibility and control of clusters and the cloud native applications that run on them. These tools can be categorized into classes; tools provided by cloud vendors to manage “cluster as a service” instances that they offer, and tools that aren’t limited to a single vendor. Some examples:
Amazon’s Elastic Kubernetes Service provides a management dashboard that presents a list of configured clusters that you can drill down into. From this top level pane, you can also create and destroy clusters. Upon drilling down you are presented with a Kubernetes Dashboard that includes cluster overview metrics, including cpu, memory, and storage consumption, along with workload statistics. From here you can drill further down into individual cluster nodes. All clusters are created on AWS using the EKS service.
Google GKE, like EKS, presents a list of clusters from where you can drill into the details. GKE provides multi-cluster management via its MCSoffering. From the top level you can add, modify, upgrade, and delete clusters. When drilling down, you are presented with a Google Kubernetes dashboard from where you can see and manage workloads, services, configuration, and storage.
MCS extends the Kubernetes service concept so it can span multiple clusters, providing a native way to describe multi-cluster services. To support access to multi-cluster services, it also provides a multi-cluster ingress. Cluster health and workload monitoring is provided by integrated dashboards. Like EKS, the management service (MCS) applies to GKE clusters only.
Microsoft Azure’s entry into the Kubernetes arena is called Azure Kubernetes Service, and provides features very similar to the other big providers. For multi-cluster management, Azure offers Arc, which provides multi-cluster Kubernetes management beyond the Azure cloud. A prerequisite for this capability is the use of Azure Arc Enabled Kubernetes, which is to say a CNCF supported Kubernetes cluster running an Arc agent.
The dashboard is integrated with the Azure console, and displays resources at multiple levels, and displays configuration parameters. Cluster and workload monitoring is integrated with Azure Monitor.
Openshift is a self-managed Kubernetes platform from Red Hat derived from the OKD open source Kubernetes distribution. It extends the Kubernetes project with several other projects to enhance enterprise readiness. Red Hat also offers managed Openshift deployments on Azure, AWS, and IBM Cloud. Multi-cluster management is provided by the Advanced Cluster Managementtool. The tool lets users deploy workloads, apply security policies, and visualize operations across multiple Kubernetes clusters.
The dashboard provides a high level overview of health and usage, and enables users to dive into details from there. Clusters can be deployed on public and private clouds.
Rancher, is an open source software product focused on managing Kubernetes clusters. Rancher itself has no Kubernetes as a service offering. Instead Rancher offers automated Kubernetes installation and management on arbitrary compute nodes (Linux and Windows) and via managed service providers like EKS and GKE. Rancher provides multi-cluster management via a series of dashboards that span on-premises and cloud providers, offers workload deployment, and can unify RBAC across all. Rancher also offers a SaaS version of its product.
Rancher is pioneer in Kubernetes cluster management and puts more emphasis on management and visibility of clusters themselves, and less emphasis on workload management across clusters. A reasonable metaphor is that it is the equivalent of provider neutral cloud management platform (CMP) for Kubernetes.
Like Rancher, Cloudify is an open source multi-cluster management tool. Unlike Rancher, Cloudify doesn’t limit it’s management and orchestration capabilities to Kubernetes. Cloudify supports the major public cloud Kubernetes as a service providers like GKE, EKS and AKS, along with Helm3, Kubespray, and Openshift support for home grown clusters. By virtue of its unopinionated orchestration pedigree, it can manage multiple Kubernetes clusters, as well as a mixture of Kubernetes and non-Kubernetes services. Cloudify supports multi cluster management by CLI and REST API, as well as a non-Kubernetes specific dashboard where Kubernetes clusters (and other managed entities) are presented in a drillable deployment list.
Cloudify provides multi-tenant and role based authentication capabilities, as well as support for day 2 operations across all managed services (Kubernetes and otherwise), including cascading and recurring workflows, state synchronization, and policy based workload placement. It also supports workload portability across Kubernetes platforms, enabling late binding of target environments.
Ready to see Cloud Native Transformation in action?
Download Cloudify now!
Cloud Native Migration Challenges
Moving from monolithic application architectures to cloud native architectures can be daunting. Cloud native doesn’t mean package your monolith in a container, although many claiming the cloud native title do just that.
Cloud native architecture can be thought of as a three legged stool, each leg representing an operational goal:
- High Availablity: the application should be built to tolerate failure. Elements of a system should be able to fail (or be shut down, or moved ) randomly without causing system downtime. Many cloud migration challenges are a consequence of this goal.
- Live Upgrades: the application should be able to be upgraded, downgraded, and patched without downtime. The ability to deliver customer value efficiently is a primary goal of cloud native architecture.
- Business/Product Orientation: the application and related teams should be organized around business deliverables, not technology layers. The idea is that customer deliverables are correlated with the runtime services that deliver it, as well as the teams that provide it. This can be a challenge for organizations that organize teams based on technical expertise (database, UI, network, server, etc…).
For Kubernetes, the answer to support these goals is containers and a container management platform. In the abstract, containers are not required for a cloud native architecture, but Kubernetes is opinionated regarding containers, largely because of the agility they provide (extremely low overhead in both speed and memory domains).
Cloud native Kubernetes architecture organizes applications into loosely coupled services (the somewhat misnamed “microservices”), each with an independent life cycle. The granularity of these services is bespoke, based on the best organization to support the cloud native goals above. In general, the fewest number of services that are adequate to support the goals is ideal. Service proliferation is a source of great complexity (note the rise of the service mesh), and inter-service communication is slow, involving marshaling and network communication.
The typical migration process involves all three steps above in parallel, along with the learning curve associated with Kubernetes itself, which is significant even if you are outsourcing the management of your clusters. As adopting cloud native ultimately involves adjustments across the organization (imagine delivering software daily, or continuously), it can be daunting to consider.
A popular initial strategy to get various teams feet wet is to identify a candidate service in the monolith for cloud native implementation. Ideally this is a library or grouping of functionality that fits well behind a simple interface. Core logic for the service need not necessarily change (although it may), but the packaging will. You’ll need to define an API (probably REST API) for the service, and make sure it can be relocated and restarted with full recovery. Then you’ll need to write a client SDK. The monolith can then be altered to use the service. This exercise will build confidence and skills for the complete transformation.
General purpose tools like Cloudify can aid here, making it possible to incrementally migrate capabilities. This is made possible by the nature of a tool that can automate current tech stacks under the same umbrella as Kubernetes, and facilitate them working together. Perhaps the web tier goes first, as it is often easiest to make the cloud native transition. Later, other services can be added until the entire application is cloud native. It should be noted that the nature of cloud native applications also aids in making this possible, because of their fault tolerance: there should never be a “big bang” event when deploying a cloud native app. Microservices can be spun off from the monolith until the migration is complete.
Cloudify Kubernetes Multi-Cluster Use Cases
Cloud Native Application
Edge computing is a natural fit for the agility and high density of containerized solutions. Recently Cloudify demonstrated a Kubernetes based multi-cluster edge computing solution managing containerized network functions from F5, Intel QAT hardware acceleration, SR-IOV, CPU core pinning for pods, and more.
Beyond managing multiple Kubernetes clusters, Cloudify facilitates intelligent workload placement based on policies, for example placing workloads with very high performance requirements on clusters and nodes with appropriate hardware acceleration. Based on criteria such as location, resource availability, and special resource requirements, Cloudify provisions a workload to the correct Kubernetes cluster. Cloudify also supports deploying workloads across multiple clusters using a labeling system reminiscent of Kubernetes itself, the configuration of networking between clusters, and the execution of potentially complex, user defined, day 2 operations across clusters, both scheduled and ad hoc.
Driven by the low latency requirements of 5G wireless technology, and the resulting need for high density edge deployments, containers and Kubernetes has become a popular option for edge computing. The nature of 5G (short range: about 2% of 4G, obstacle sensitivity generally requiring line of sight) produces a need for far more antenna coverage than 4G, and anticipated applications (e.g. self driving cars, augmented reality) are forcing compute resources closer to the edge (i.e. users). If Kubernetes is running in those edge environments, then multi cluster management is a critical part of the picture.
Cloudify and AT&T have collaborated on a cloud native edge solution for 5G along with the Linux Foundation ONAP project. The solution places Kubernetes clusters in edge sites running the ONAP Akraino stack for edge computing.
The architecture puts Cloudify in the role of multi-cluster manager for potentially large numbers of edge sites, providing edge stack deployment, self healing, and connecting edge sites with aggregated views provided by Grafana and Prometheus.
5G wireless communication requires edge computing, and pushes services closer to customers at the edge. From the operator perspective, it is desirable to deploy end to end isolated networks for service workloads. This need, and the rise of network function virtualization (NFV), has led to the concept known as 5G network slicing. Network slicing requires an orchestration software layer to manage the many configurations and deployments of virtual network functions (VNFs), and potentially related hardware configurations. In order to produce a 5G network slicing implementation, a broad range of services must be orchestrated, from Radio Access Network (RAN) to edge and core networks.
The ETSI Mano NFV architecture defines a layered orchestration model ( NFVO, VNFM ) to manage network functions. Cloudify assumed these roles and more in the implementation of 5G network slicing in AWS.
Cloudify, triggered by the ONAP SDC, orchestrated the end to end 5G network slicing implementation using AWS services including CodePipeline, CloudWatch, Lambda, and EKS hosted containerized network functions (CNFs) running on AWS Outposts. The user enters some parameters such as the slice edge location, slice differentiator, and slice domain. Slice orchestration included EKS cluster deployment, CNF instantiation/licensing, baseline configuration and day 2 operation.
The migration from monolithic to cloud native architecture and microservice oriented applications is steadily gaining steam. This transformation produces many benefits, including availability, scalability, portability, and seamless integration into modern agile development practices. These benefits come with costs though, mainly due to increased complexity both of the platform itself (Kubernetes usually), and microservices architecture itself. Layers of cloud native automation have emerged in industry to attempt to contain this complexity, including cluster deployment and management platforms from major cloud providers as well as cloud agnostic options.
As adoption of Kubernetes continues apace, the realization is growing that multi-cluster management is not an obscure edge case, but rather a standard feature of any reasonable delivery pipeline, independent of the applications being deployed. Beyond the single cluster application lies the multi-cloud/multi-cluster application that leverages multiple clouds and/or availability zones to exploit cloud provider features, lower costs, and achieve greater availability. Finally, few cloud native applications exist in a vacuum, but are integrated with non-cloud native legacy applications, which themselves require automation services.
For those with hybrid environments, Cloudify provides an unique value proposition because of its agnostic approach to orchestration. Multiple Kubernetes clusters, non-cloud-native services, cloud native infrastructure, Kubernetes lifecycle management, Kubernetes application lifecycle management, and legacy applications can be orchestrated under a single umbrella, with built in Gitops integration for end to end automation. Even for those who use a single cloud provider that provides multi-cluster Kubernetes tools, the use of an agnostic orchestrator like Cloudify reduces lock-in, and makes it possible to leave the door open to future multi-cloud opportunities when they arise.
Ready to see Cloud Native Transformation in action?
Download Cloudify now!
Cloudify is a declarative, extensible, standards based, open source orchestrator built to provide automation for heterogeneous infrastructure and services. It integrates with all major cloud platforms out of the box ( AWS, Azure, GCP, VMWare, Openstack, etc..) and is highly customizable and adaptable for future technologies. Likewise, it integrates with existing tool chains beyond infrastructure, such as (CI/CD, Ansible, Kubernetes, Terraform, etc …). Its extensible, DSL vocabulary makes it easy to tailor to arbitrary use cases and levels of complexity, delivering a declarative everything as code experience. Also, the DSL is composable (not monolithic ), allowing for a microservices approach to orchestration automation.
Features and Benefits
Effortless transition to public cloud and cloud-native architecture by automating existing infrastructure alongside cloud native resources.
Break automation silos using an ‘orchestrator of orchestrators’ approach which can house all automation platforms (Kubernetes, Ansible, AWS Cloud Formation, Azure ARM, Terraform, etc.) under a common management layer. And be ready to incorporate new platforms and technologies as they appear.
Cost optimization by having end-to-end modeling of the entire infrastructure that enables cost-saving policies, such as decommissioning of resources. The decoupling of model from implementation in Cloudfy makes systems both more understandable and maintainable.
Reduced deployment time by bringing infrastructure, networking and security into reusable and ‘templatized’ environments, allowing deployment of a variety of tasks in hours rather than weeks, for applications that run on similar configurations. The composability of Cloudify templates makes it possible to abstract elements to their essentials so they can be substituted for each other; a form of polymorphism at the system level.
A highly customizable catalog and portal framework, built to provide a self-service experience that is tenant-aware. Not only customizable by selecting prebuilt “widgets” for inclusion in the console, but customizable via user defined widgets that can supply any user experience desired.
A horizontally scalable architecture that can support almost unlimited numbers of orchestrated resources. Open source. Not a black box that hides poor implementation, security holes, bugs, and closed to user contributions.
Hybrid-Cloud – a cloud solution or application that spans multiple public and/or private clouds.
Multi-Cloud – a cloud solution or strategy that utilizes on multiple cloud platforms, but doesn’t combine them.
Public cloud – a (probably vast) collection of physical servers running virtual machines, that present the individual hardware as a single pool of compute, storage, and networking. Made available by the internet to the public via API and GUI. Examples include AWS, Azure, GCP, etc…
Private cloud – Like a public cloud, a collection of servers running virtual machines, presented as a single resource, but only made available internally to the enterprise. Examples include Openstack, Azure Stack, and VSphere.
Cloud Infrastructure – the “I” in “Iaas”, cloud infrastructure refers to virtual networking, storage, and virtual machines (or containers) made available on demand via software (UI and API).
Continuous Integration (CI) – the practice of performing automated testing frequently on a code base. Testing is often triggered automatically by events in a version control system ( e.g. developer contributions/merges).
Continuous Delivery (CD) – the practice automated delivery of software to staging and/or production. Generally paired with continuous integration, but triggered by significant milestones (releases & bug fixes).
Declarative – As contrasted with imperative (or procedural), a way of programming by means of specifying desired outcomes via a model, as opposed to supplying a sequence of commands.
Service Orchestration – The deployment, updating, operation (scaling/healing), and decommissioning of software onto virtual or physical servers.
Container – an isolated execution environment in an operating system, a lightweight alternative to a virtual machine.
Virtual machine – a software emulation of a physical server.
Kubernetes – An open source container management platform that abstracts underlying hardware (virtual or physical), networking, and storage.
Data Center – A facility that hosts computer hardware. A data center centralizes compute capacity, and so is typically subject to stringent physical security and disaster/failover protections. Data centers are the building blocks of the cloud, and a failover pair of them can roughly map to availability zones in public clouds.
Availability Zone – A “virtual” data center that provides failover protection for high availability. An availability zone will typically consist of geographically separated data centers in a particular region.