In SRE, Toil is the enemy of scale. It is the manual, repetitive, tactical work that provides no long-term value and increases as the service grows.
What Exactly is Toil?
According to Google's SRE book, toil has specific characteristics: it's manual, repetitive, automatable, tactical, lacks enduring value, and scales linearly with service size. Common examples include manual password resets, restarting a service that leaks memory, or manually running a deployment script.
Why Reducing Toil is Essential
Toil causes burnout, decreases productivity, and slows down innovation. If an SRE team spends all their time "feeding the machines" with manual tasks, they have no time for the engineering work that makes the system better, more scalable, and more reliable.
Strategies for Eliminating Toil
- Measurement: You can't fix what you don't measure. SRE teams should track how much time they spend on toil vs. project work.
- Automation: Identify repetitive tasks and build tools or scripts to handle them automatically.
- Self-Service: Empower developers to perform common tasks themselves (e.g., through an Internal Developer Platform).
- Standardization: Reduce complexity by standardizing configurations and workflows.
Internal Links
Reducing operational toil is a primary focus of our SRE Consulting. By leveraging Observability and clear SLOs, we help teams identify where automation will have the biggest impact.
MeloMar IT helps organisations improve reliability through practical SRE and platform engineering guidance.