The fastest way to reduce MTTR in enterprise operations is to compress the diagnosis phase, the middle stretch between an alert firing and an engineer’s first action. Mean Time To Resolve (MTTR) is the headline number most operations leaders report up, and it’s also the number that hides where the time actually goes.
When we instrument it properly across customer environments, the breakdown is consistent and uncomfortable: detection is fast, resolution is fast, and the middle, the diagnosis phase, eats most of the clock. That’s the part of operational intelligence almost no one measures separately, and it’s also the part with the largest controllable upside. This post is the math, the framework, and the conversation you can take to your CFO when you’re trying to justify the next ops investment.
If you can’t show your CFO where the 45 minutes of MTTR actually go, you can’t get the budget to compress them. Let’s fix that.
What is MTTR actually made of?
MTTR has three distinct phases, and they don’t behave the same way under pressure. Detection time is the gap between problem occurrence and alert firing. Diagnosis time is the gap between alert firing and the engineer forming a confident hypothesis. Resolution time is the gap between first remediation action and service restored. Each one has a different cause, a different cost curve, and a different fix. Collapsing them into one number is why most ops investments aim at the wrong phase.
Here’s how a typical P0 incident on a heterogeneous enterprise stack actually breaks down, based on incident timelines we’ve reviewed in telco, banking, and CPG environments:
- Detection: 30 seconds to 5 minutes. Modern monitoring is genuinely good at this part.
- Diagnosis: 25 to 45 minutes. The engineer is awake, the laptop is open, the page is in their hand, and they’re correlating across six tools by memory.
- Resolution: 5 to 15 minutes. Once the hypothesis is right, the fix is usually fast.
Diagnosis is the longest phase by a wide margin. It’s also the phase with the least dedicated tooling. Detection has a category called monitoring. Resolution has a category called runbooks and automation. Diagnosis has Slack and adrenaline.
Why is diagnosis the largest controllable phase?
Because detection and resolution have already been heavily optimized, while diagnosis still depends on a human stitching together signals from systems that don’t talk to each other. The marginal minute of detection improvement at this point costs more than it saves. The marginal minute of resolution improvement requires runbook investment that already exists for most P0 services. Diagnosis, in contrast, is a 30-minute block per incident with almost no dedicated infrastructure aimed at it.
That makes the math straightforward. If your average P0 incident is 45 minutes of MTTR and 30 of those minutes are diagnosis, compressing diagnosis by 70% takes 21 minutes off every P0 incident. That’s not a marginal gain. That’s a step change in customer-facing impact per incident, and it doesn’t require touching any of your existing monitoring or remediation tools.
Most operations leaders don’t see this opportunity because their dashboard reports MTTR as a single number. The fix is to instrument the three phases separately, even crudely. You only need:
- Alert firing timestamp (already in PagerDuty or equivalent)
- “First action taken” timestamp (engineer-reported or inferred from first command)
- “Service restored” timestamp (already in your incident record)
Those three timestamps give you detection-to-action (diagnosis) and action-to-restored (resolution). The gap most ops dashboards leave unmeasured is the first one. Close that measurement gap and the case for compressing diagnosis writes itself.
What does each minute of diagnosis actually cost?
The cost of a diagnosis minute is the sum of three things: customer-facing revenue impact, escalation cost, and the long-tail cost of senior engineer attrition. Most ops budget conversations cover the first one. Almost none cover the other two, even though they’re often larger over a 12-month horizon.
Here’s how to build a defensible number for each, using inputs you already have:
1. Customer-facing minutes
For each P0 service, the business already has a cost-per-minute estimate (revenue impact, SLA penalty, brand cost). Multiply that by your average diagnosis time per incident, times your annual P0 incident count. For a payments service at $50K/minute and 12 P0 incidents per year with 30-minute diagnosis phases, the diagnosis phase alone costs $18M/year in customer-facing impact. Compressing it by 70% recovers $12.6M.
2. Escalation cost
Every wrong-direction escalation has two costs: the senior engineer’s time and the slower fix. NOC operators escalate when they can’t tell whether an alert is a blip or a major incident, which is exactly the question diagnosis answers. If your NOC over-escalates 30% of the time and your senior on-call comp loaded rate is $200/hour, every 100 P0 incidents includes 30 wrongly-escalated investigations averaging 90 minutes of senior time. That’s $9,000 per year of senior bandwidth wasted on avoidable escalations, per 100 incidents. For a high-volume environment, multiply accordingly.
3. Senior engineer attrition
The same two or three senior engineers get woken up over and over because they’re the only ones who can navigate the tool maze quickly. Senior SRE replacement cost is typically 1.5 to 2 times annual salary, plus six months of degraded incident response while the replacement ramps. If diagnosis tooling lets junior operators handle 50% of incidents that currently require seniors, you reduce senior on-call frequency by half, which materially changes retention.
A defensible 12-month case combines those three numbers. In our experience, the first one alone usually justifies the investment. The second and third are why operations leaders who already know the first one still struggle to keep talent.
What does “good” diagnosis time look like?
A reasonable target for an enterprise environment is 5 minutes from alert firing to a structured, evidence-backed hypothesis, with sources cited. That’s a 70 to 85% compression of the typical 25 to 45-minute baseline. The number isn’t theoretical. Teams achieve it once an operational intelligence layer is reading from their existing monitoring, deployment, and topology tools and producing the briefing humans were producing by hand.
The benchmark to set internally has four components:
- Time to first hypothesis: target under 5 minutes
- Hypothesis accuracy on first attempt: target above 75%, measured against the root cause in the post-mortem
- Junior-operator-only resolution rate: target above 60%, meaning the on-call engineer fixed it without paging a senior
- Cross-shift context loss: target under 5 minutes of catch-up time per handover
These four numbers don’t show up in most ops dashboards today. They should. They’re the operational metrics that determine whether your team scales or burns out.
Where this changes the budget conversation
The standard ops budget conversation pits monitoring vendors against each other for the same dollar. The diagnosis-phase reframe changes the conversation entirely. You’re no longer asking “which observability tool should we buy more of.” You’re asking “what compresses the largest controllable phase of MTTR for the least disruption to the stack we already run.”
That’s a different question, and it has a different answer. The real opportunity isn’t another monitoring layer. It’s in the connection layer above the monitoring you already have. The investment thesis goes from “incremental visibility improvement” to “step-change compression of the longest, most expensive phase of incident response.”
Most CFOs will fund the second pitch. Almost none will fund the first one again.
We built Argus as that connection layer. It reads from Datadog, Grafana, Elastic, GitHub, Jenkins, PagerDuty, ServiceNow, and the other tools an enterprise on-call engineer already touches, and produces a structured 90-second briefing the moment an alert fires. The compression we see across customer environments is 40 to 70% in diagnosis time, with the math above as the underlying justification.
If you want to run this math on your own environment, two practical steps:
- This week: split your MTTR dashboard into the three phases. Even rough timestamps reveal the diagnosis gap immediately.
- This quarter: book a 30-minute walkthrough, bring one real alert from your environment, and see the structured briefing a connection layer would have produced.
The 45-minute MTTR isn’t the cost of doing enterprise IT. It’s the cost of leaving diagnosis unaddressed because it didn’t have its own category on the dashboard. Once it does, the investment case writes itself.
For the longer argument about why the connection layer is a different category than monitoring, see our reframe on the operational data problem. For the field-level story behind the math, see the 45 minutes no one talks about.
FAQ
You don’t need to rebuild anything. The three timestamps you need (alert fired, first action taken, service restored) are already captured in PagerDuty or your equivalent. Pull them into a simple report. The gap between alert fired and first action is your diagnosis time. That single chart is usually enough to change the budget conversation.
The biggest gains come on services with heterogeneous monitoring footprints and recent deployment activity, which describes most enterprise P0 services. Simple, mature services with one obvious dashboard see smaller compression because the manual diagnosis is already fast. The high-value compression is on the complex services that consume disproportionate senior on-call time today.
No, and that matters. A connection layer fits inside the existing on-call workflow. The engineer still receives the page from the existing alerting system, still makes the call, still owns the resolution. The change is what they see when they open their laptop: a structured briefing instead of a blank screen. Workflow stays the same. Starting position improves.
Two or three connectors into your highest-value monitoring sources, scoped to one service portfolio, run for 30 days against your real incidents. That’s enough volume to measure diagnosis-time compression against a baseline and produce the number your CFO will want to see. We typically structure this as a 30-day proof rather than a multi-quarter evaluation.