Developers and ops people sometimes have different perspectives on why a deployment went wrong. Was it the code? Or the infrastructure? In the end, it’s the user that suffers, and that user doesn’t care about why your product was broken. They just want it to work, so they can do their work.
It really did work in staging
Everyone on the team wants to get things right the first time, deploying a quality product with every feature release. That’s why development teams adhere to deployment best practices. They write tests. They write code. They test that code in staging environments that mimic the production environments that code will run in… as closely as possible. Every member of the team approaches projects with not only best practices, but with the best intentions. But, as the saying goes, we’re all only human.
Sometimes humans miss things
A slightly different network configuration here, an older version of a PHP extension or NPM module there, and suddenly the places where code is being tested before launching it to the world aren’t quite perfectly in sync. And the code that ran fine in QA falls down in production. It gets worse when teams meet bottlenecks. For example, a limited set of environments, or worse, a single staging environment that code has to pass through. Who hasn’t done a ‘hotfix’ to resolve a critical problem in prod (without the same testing process we’d normally use) because we couldn’t wait to send it through QA or staging environment in use by others?
It really was a code change
Or maybe it was a different version of the XYZ module after all? How do we know? Often management of code and infrastructure configuration are siloed, sometimes driven and managed by separate teams. Though software development has come a long way — test-driven development (TDD), rigorous code review processes, automated testing, and more — it remains rare to manage the whole system including infrastructure and data with the same tools and process. As a result, it’s difficult to tell at a glance which change to the system was the direct cause of a fault, and sometimes it’s hard to know that the system state has changed at all. That’s due, in large part, to having separate tools and processes to manage change in code versus infrastructure, dev versus ops.
How to drive DevOps alignment
Developers are measured by the features they ship. Ops is measured by uptime and performance. The most important person, the user, doesn’t differentiate between who’s at fault if the product/site/service/experience you’re offering them falls short of their expectations. How do we solve for the user’s concern and deliver features they want, and reliability they expect?
It’s time to stop working in silos
Development and ops need to take a unified approach to managing change. The scripts and tools that manage infrastructure should be subject to the same process and rigor, and managed as a system, instead of independent parts. At a glance, you should be able to tell when a version of a runtime was updated, or an application code update was made.
Trust the machines
When humans are involved, drift happens. Changes that are made to a production configuration don’t always make it back to staging and development environments. Humans take shortcuts. Humans forget things. Sometimes it’s a cost concern, sometimes it’s a technical one, sometimes it’s just human error. But when differences emerge, systems break, and tests are no longer valid. Automation is the answer. Replicate all changes made to production environment configurations to your staging and dev, from infrastructure configuration to code. Oh, and let’s not forget about the impact of data. You’ll want to sync your production data (scrubbed as needed) back to your other environments to get a true “like for like” test. All of this automation can be difficult and time consuming, not to mention expensive. Therefore, it’s uncommon. Further, due to organizational, process, and tooling silos, “dev,” “stage,” and “prod” are only loosely linked in many organizations. So it’s a challenge just to get access to the systems that need to match.
Get non-linear
System stability is often negatively impacted by human activity. We humans break rules. We find workarounds. We do things ever-so-slightly differently each time, especially when we’re tired or stressed or hungry. When we’re faced with a demand to ship, under pressure, we may overlook some testing. Or work around a congested staging environment and commit to prod. To prevent this, dev and ops teams should again work towards greater automation and deploy tools that allow them to work in parallel, rather than in series, on environments that closely mimic production systems.
Try and get more predictable
The mantra of a DevOps culture should be “automate everything.” Trust the machines. Once deployments become sufficiently automated, they become non-events. You should be able to deploy your system at any time — even on a Friday afternoon. Getting there requires both organizational alignment around the goal of predictability with speed, as well as the right toolkit. That combination enables teams to deliver what users care about — reliable, featureful apps that run well, all the time.