Summarizing a heated debate on the state of DevOps and a rational path forward
We’re all used to spicy social media debates producing more heat than light. But occasionally, the script is flipped and something useful is illuminated.
Such is the case with a recent debate about the state of DevOps. It was started in the comments on a post by Leon Wright titled, “No one should write Terraform.” That spawned threads on Twitter (Sid Palas), with more conversation on Reddit, here and here.
The bottom line? Developers have strong and often conflicting opinions about the “you build it, you run it” paradigm first advanced by Werner Vogels. All this led me to a question posed on LinkedIn:
“Are we moving from “everything as code” to “no one should write Terraform? Is this becoming a trend?”
I was surprised by how quickly this became viral. And just a few weeks ago, Aeris Stewart took this a step further with her article, “DevOps Is Dead. Embrace Platform Engineering,” arguing that we need to re-think the entire DevOps practices as we know it today.
All this is a clear indication of a growing frustration between developers and DevOps engineers around some of the most fundamental questions which started the entire DevOps movement back in 2006 when Vogels first put forward the, “you build it, you run it” concept and launched the DevOps movement.
Let’s look beyond the provoking titles and click-inducing comments aside to think about all these discussions as a way for developers to shout loudly, “Hey DevOps folks, wake up! Something isn’t working with ‘you build it, you run it’ and making the infrastructure look like code isn’t solving it!” So let’s look at why this frustration is surfacing now and more importantly, what we can do about it to chart a path forward that still lets us ship software fast.
Why Now? A Growing Infrastructure Complexity Reality Check
If we examine the DevOps toolchain progress achieved in the past decade, a focus on automating infrastructure management jumps out at you. From the introduction of CI/CD pipelines and infrastructure as code to containers and microservices. Recently the focus has shifted to how we do all this more reliably and securely with concepts like DevSecOps and FinOps. All of this became known as Shift Left, and developers gradually found themselves responsible for things that used to take place in production.
Most recently, the focus shifted to simplifying the developer experience in this brave new world. Interestingly, the same Werner Vogels whose prophecy lit the DevOps fuse recognized the need for this more than a year ago.
It Gets Worse at Scale
“The further you grow, the more fragmented your ecosystem becomes, and then everything slows down again” – Spotify
Many thought the solution to the shift left complexity was a nifty set of new tools to automate infrastructure. But these grew like an algae bloom; each tool specialized in automating a specific domain, and soon we were choking under a new complexity.
Throwing more DevOps at the problem doesn’t scale. The ratio between developers and platform engineers can vary between 1:10 at a small scale and quickly rise to 1:100 or beyond. This leads to a continuous degradation in the level of service that platform engineers can provide to support developers. It all leads to development teams taking on a growing chunk of responsibility to manage the infrastructure needed to support their applications.
This raises an important and largely unforeseen challenge to consistency and governance. When each development team becomes responsible for managing its own infrastructure, it becomes harder to enforce consistent policies and practices across teams.
An (Imperfect) Analogy: Software in Auto Manufacturing
Car manufacturing provides a useful analogy to what’s happening here. Let’s look at a hypothetical Tesla manufacturing pipeline. Imagine that a developer is equivalent to an engineer writing an onboard navigation application, and a DevOps engineer is equivalent to a pipeline engineer responsible for getting this navigation application into the manufacturing pipeline with minimal friction.
Now let’s imagine that a new version of the navigation application introduces a bug that drains the car battery. Whose responsibility is it to detect and fix it? Or, imagine that a new version of the navigation system requires an upgrade of the existing GPS system. What should be the process of introducing the GPS upgrade into the pipeline and ensuring that it is merged before the new navigation version gets deployed?
According to the “you build it, you run it “ camp the expectation is that the navigation application developer should be responsible to detect the battery drain issue during development. The developer should also ensure that the GPS system comes with the right version before the new application gets deployed.
Maybe this makes sense. But it raises a couple of questions. First, do we expect that each developer would know how to operate all the infrastructure that his application touches to test his application? Second, how much time should the developer spend on creating the testing and development infrastructure? As opposed to actually writing the new nav app features that customers want.
Developer Experience Should Draw the Boundary between Devs and Platform Engineers
Yes, it’s an imperfect analogy. SaaS services that are pure software are significantly more agile than a car manufacturing pipeline. Cloud infrastructure is mostly composed of software units and not physical units as with cars. Still, it’s instructive and useful in considering where to draw the line between developer and platform responsibility.
It all comes down to efficiency at scale. The current reality is that developers unskilled in managing and operating infrastructure are spending nearly 80-89% of their time fixing pipelines, managing dev/ test environments, and writing glue code. And now, we’re piling on asking them to manage infrastructure as code.
We don’t need high power analysts to tell us this doesn’t scale and that it’s not an efficient use of a developer’s time. Clearly, something has to change.
Let’s look at some principles that might satisfy devs, ops, and platform engineers.
The Developer Perspective
Developers shouldn’t spend time on operating repeatable infrastructure tasks. At the same time, developers that want to push new services or updates should be able to do that without getting stacked on the platform team backlog.
The Platform Engineer Perspective
By contrast, platform engineers should be able to continuously update and evolve the developer’s infrastructure platform without worrying about breaking the developer’s pipeline. They should be able to govern the security, cost, and general efficiency of infrastructure utilization through automation without becoming a bottleneck in that process, by defining a set of policies as code. The question is, how do we get there?
Platform as a Product
Let’s come back to our Tesla reference. We want the navigation application engineer to be responsible for ensuring that the battery doesn’t get drained during the development cycle and that the GPS system comes in the right version before his new feature gets rolled into production. This should not require the developer to know how to operate the battery or GPS systems, even if they technically can.
In other words, the platform team needs to provide the developer with a set of tools and a framework that will allow them to test code in a production-like environment without throwing the entire operational responsibility of creating this environment back on the developer. That’s exactly where the platform comes into play as defined here:
“Platform engineering is a process organizations can use to leverage their cloud platform as efficiently as possible so that engineers can deliver value to production quickly and reliably. It consists of removing obstacles between developers and production.”
The key to delivering the right platform in a way that will fulfill the developer’s and platform engineer’s goals, as mentioned above, is by first treating it as a product of its own right. The concept is outlined by Max Griffiths in his post, Infrastructure as Product: Accelerating time to market through platform engineering.
“With re-usable, self-service solutions available on a platform that’s easy to understand and engage with, cloud infrastructure teams can spend less time configuring individual solutions, and give the business a rapid path to market for new digital services and capabilities.”
So, how do we achieve better collaboration between platform engineers and developers?
The open source community has a well established process on how to enable efficient collaboration and governance between product owners and external contributors. If we adopt those principles with our infrastructure platform it will allow the platform team to be responsible for the platform. This means that they can still get contributions from other development teams to enhance the platform in ways that will satisfy the needs of each application team. Developers should also be able to create a local testing environment of the platform and develop and test their contributions without any significant boundaries or manual blockers. Therefore shortening the contribution cycle.
It’s Complex, not Impossible
Most teams are capable of creating a platform that will satisfy the current needs of their developers. Quite often, however, they fail miserably in handling continuous changes and updates as the system evolves, as the requirements change, or as new cloud services become available. In this context, we need to differentiate between two types of updates that are equally important but are fundamentally different.
First, the platform update adds new services to the platform, like a new database or a messaging system. This might also include updating the configuration of an existing service, recovery from failure, or detecting and recovery from drifts. Next, an application update pushes new features or versions of our application to a staging environment prior to promoting it into production. These two types of updates often require different types of update workflows and require special handling.
About half of the solution is understanding the challenge from both the developer and DevOps angle. Based on the recent debate, we’re getting there. Perhaps another 30% are beginning to move toward agreeing on what should be the desired solution, at least in principle. The other 20% or so are figuring out how to do it. It is in this group where we tend to fail, mostly due to underestimation of the complexity of the problem and the lack of realization that a platform is a product of its own right.
The light at the end of this particular tunnel is that new solutions are showing up with different ways to solve this problem. (Gartner: How to Scale DevOps Workflows in Multicluster Kubernetes Environments).
You Build It, You Run It: The Bottom Line
Shift left is still fundamentally right, but at the same time it throws the entire responsibility at the developers, and that does not scale. Also, making infrastructure accessible as code doesn’t address the heart of the problem. The common ground is that to simplify the developer experience we need to have a layer of abstraction in the form of platform engineering and turn infrastructure into self-service environments, hence why it was marked as one of the top strategic trends in 2023 by Gartner.
The question, however, remains, what should that platform be, and what is the right handshake between developers and platform engineers? This is still a moving target, and I can’t say that there is a perfect answer, yet.
Recently, I shared how we dealt with those challenges in this post. It outlines a super simple developer experience with Backstage. Learning from others about what worked and what didn’t is the best thing we can do to shorten the learning cycle. This is precisely what we did in a set of interviews with technical leads in Cloudify Tech Talk podcasts. We also started a new local platform engineering meetup community with Daniel Ben Zvi, Roi Ezra, Mattan Cohen, Hila Fish, and Guy Naor, all in collaboration with our friends at Spotify. Another great resource is Humanitec’s Platform Engineer slack channel, which provides access to an entire community of platform engineers.
I hope we can bring more people into the conversation who are passionate about the topic to join this effort and share their experiences. Importantly, we need to get these topics onto the speaking agendas at DevOps events that too often still are stuck somewhere in 2019. So if you have a relevant story to share please join the meetup group or contact me directly and we can cover it in the next podcast.
Join the Community
Platform Engineer (By Humanitec)
You Build It, You Run It
Devs don’t want to do Ops (reddit thread)