Created on 2025-04-14 08:29
Published on 2025-05-05 10:00
It’s 3:47 AM. You’ve been asleep for maybe two hours when your phone buzzes with a familiar notification tone: “High CPU usage on production node 18.” You roll out of bed, shuffle to your laptop, and try to remember the difference between “prod-node-18” and “prod-node-81.” After a quick investigation, you realize it’s a false alarm. Again. You silence the alert, close the lid, and try to go back to sleep.
You don’t.
Welcome to the world of 24/7 on-call.
Burnout in Site Reliability Engineering isn’t new, but it is accelerating. As companies push harder for digital uptime, and as systems get more complex, the humans behind those systems are fraying at the edges. And nowhere is this more obvious than in the practice of always-on, always-ready on-call rotations.
On-call is foundational to SRE. After all, someone has to respond when production melts down at 2 AM. SREs are trained for this. They have runbooks, paging systems, and incident response frameworks. But for all the tooling, there’s one thing the industry hasn’t fully reckoned with: the emotional and physiological cost of being perpetually responsible for keeping things alive.
The Toll of On-Call
Ask any SRE who’s been doing this for a few years, and you’ll hear the stories: missed dinners, interrupted vacations, panic attacks from phantom alerts, insomnia on Sunday nights before their on-call shift starts. The toll is cumulative. It’s not just the nights you get woken up—it’s the nights you might. It’s the inability to disconnect fully. The way your brain stays half-alert even when nothing is wrong. It’s the muscle tension that never quite goes away.
Studies in other high-stakes professions—like medicine or emergency services—show that this kind of chronic anticipation leads to heightened cortisol levels, disrupted sleep cycles, and eventually, burnout. And yet, in tech, we often treat on-call as just part of the job.
But here’s the thing: SREs aren’t robots. And if we want to retain them—and more importantly, keep them healthy—we have to
rethink how we do on-call.
The Case for 24/7 On-Call
Let’s be fair: there are good reasons on-call exists.
Production incidents don’t wait for business hours. Customers expect reliability around the clock. And modern systems, built on microservices, third-party dependencies, and global user bases, can fail in unexpected ways at any time.
Having trained responders available 24/7 ensures fast resolution. It minimizes impact. It demonstrates a commitment to operational excellence. And when done well, it fosters a sense of ownership and trust.
On-call can also be a powerful learning experience. You see how systems behave under pressure. You learn what really matters in production. You develop empathy for users and for fellow engineers.
But these benefits only materialize when the system around on-call is healthy.
When On-Call Becomes Exploitative
Too often, on-call is broken.
Rotations are too small, leading to frequent shifts.
Alerts are noisy and unactionable.
Escalation policies are unclear.
There’s no compensation—or worse, guilt-based culture that discourages speaking up.
In these environments, on-call stops being a duty and becomes a burden. It drives people away. It creates silent resentment. And worst of all, it leads to errors—because tired engineers make mistakes.
Signs You’re Burning Out Your SREs
People swap shifts without telling management.
Incidents get acknowledged but not resolved.
Engineers dread their on-call weeks.
Postmortems are shallow or avoided.
Resignations follow on-call cycles.
Sound familiar? Then it’s time to act.
What Good On-Call Looks Like
Healthy on-call isn’t just about alerts—it’s about the system around the alerts.
Here’s what works:
Smart Alerting: Use SLO-based alerting. Page only when user impact is happening or imminent.
Generous Rotations: Limit on-call to once every 6-8 weeks per engineer. Use follow-the-sun models when possible.
Tooling: Automate diagnostics. Provide rich context in pages. Make it easy to respond fast.
Support: Provide backup engineers. Ensure incidents are debriefed. Protect recovery time.
Compensation: Pay for on-call. Period. Time, money, or both. It’s not optional labor.
Psychological Safety: Normalize “I need a break.” Remove shame from stepping back.
Leadership Modeling: When leaders take on-call, it sends a powerful message.
The Debate: Opt-In vs. Mandatory On-Call
One of the more contentious topics is whether on-call should be mandatory. Some argue yes: if you build it, you run it. Reliability is everyone’s responsibility. On-call ensures shared ownership. Others say no: not everyone is wired for late-night stress. Forcing introverted, neurodivergent, or highly anxious engineers into high-stress roles damages both people and systems.
The best compromise? A tiered model.
Allow engineers to opt into different levels of responsibility. Offer incentives for higher tiers. And always provide a clear off-ramp. Let people grow into on-call—not be thrown into it.
On-Call in a Remote World
Remote work has added complexity. The boundaries between work and life are blurrier. The temptation to check in during “off” hours is strong. Some SREs now live in rural areas or across time zones, making synchronous response harder.
This demands new thinking:
Async runbooks.
Distributed paging.
Local failover teams.
The remote SRE era needs empathy, flexibility, and better documentation.
A Final Story
A senior SRE I knew once said, “The worst part of on-call isn’t the wakeups—it’s the anxiety on the days you weren’t paged.” That stuck with me. Because it captured the essence of burnout: not the spikes, but the constant hum. The background dread. The weight of being always on.
SREs are some of the most resilient, adaptable, and brilliant engineers in tech. But they’re also human. And no amount of caffeine, automation, or incident retros can fix a broken culture around on-call.
If we want reliability in our systems, we need sustainability in our teams.
Let’s build both.