80% of Recovery Time Is Wasted Asking "What Changed?"

Incident ResponseThe OpsTrails Team|December 22, 2025|5 min read

The incident isn't the expensive part. The detective work is.

Here's a scenario every on-call engineer knows. The alert fires. PagerDuty wakes you up. You open your laptop, bleary-eyed, and stare at a dashboard full of red. Something's broken. The clock is ticking. And the first thing you do isn't fix the problem — it's try to figure out what the problem actually is.

You check Slack. You scroll through the deployment channel. You look at the CI/CD pipeline history. You SSH into a box and grep some logs. You message the person who was on call before you. You open three different dashboards. Twenty minutes pass. Forty minutes. An hour. You still don't know what changed.

This isn't a failure of engineering skill. It's a failure of operational visibility. And the data shows it's shockingly common.

The MTTR Black Hole: Why 80% of Recovery Time Is Non-Productive

The authors of The Visible Ops Handbook — based on their experience working with hundreds of IT organisations — found that 80% of Mean Time To Recovery (MTTR) is wasted on non-productive activities. The dominant time sink? Determining which change is responsible for the outage.

Let that sink in. Four-fifths of your incident response time isn't spent fixing the problem. It's spent finding the problem. The actual remediation — rolling back a deployment, reverting a config change, scaling a resource — is often trivial once you know what went wrong. The expensive part is the detective work.

Why Incident Investigation Takes Longer Than the Fix

In most organisations, operational knowledge is fragmented across dozens of systems and people's heads. A deployment might be logged in GitHub Actions. A config change might be noted in a Terraform plan output. A data load might be tracked in an internal tool. A feature flag change might not be tracked anywhere at all.

When an incident occurs, the on-call engineer has to mentally reconstruct a timeline from these scattered sources. They're essentially doing investigative journalism under pressure, at 2am, with a ticking SLA clock. It's no wonder 80% of the time is wasted.

The Visible Ops research also identified what they called an absence of a "culture of causality" — people manage and work by intuition and gut feel rather than using systematic problem-solving connected to change data. When there's no structured record of what changed, responders fall back on guesswork, tribal knowledge, and whoever happens to be awake.

The Compounding Cost of Slow Incident Recovery

This isn't just about one bad night on call. The MTTR waste compounds in several ways.

First, there's the direct cost of extended outages. Downtime costs $5,600 per minute according to Gartner. Every minute spent in detective mode rather than remediation mode is money lost.

Second, there's the human cost. Engineers who regularly endure long, stressful incident responses burn out faster. They develop the "pager culture" that Visible Ops describes — a belief that true control simply isn't possible and they're doomed to an endless cycle of break/fix.

Third, there's the repeat cost. Without a clear operational timeline, post-incident reviews are less effective. If you couldn't determine what changed during the incident, your retrospective is unlikely to produce meaningful preventive actions. The same category of problem recurs.

How an Operational Timeline Eliminates MTTR Waste

The solution isn't a more sophisticated monitoring tool or a better alerting threshold. It's a structured, queryable record of what changed in your environment and when.

When every deployment, rollback, config change, data load, and infrastructure modification is captured as an event in a single timeline, the question "what changed?" goes from a 45-minute investigation to a 10-second query.

OpsTrails builds exactly this timeline. It captures operational events from your existing tools — CI/CD pipelines, deployment systems, databases, infrastructure — and exposes them through a Model Context Protocol (MCP) server. That means your AI assistant can answer "what changed in production in the last 2 hours?" instantly, without anyone needing to open a dashboard or send a Slack message. See our MCP setup guide to connect your AI assistant.

The 80% waste in MTTR exists because the information is scattered. OpsTrails centralises it. The rest is just asking the right question.

OpsTrails eliminates the detective work. Your team — and your AI assistants — get instant answers to "what changed?" so you can spend 100% of MTTR on the actual fix.

→ Cut your recovery time

Sources: The Visible Ops Handbook (IT Process Institute, Behr, Kim, Spafford, 2005), Gartner (downtime cost analysis), IT Process Institute (MTTR research).