Barking Up the Right Tree: Why Every SRE Needs to Unleash Themselves from Toil

gemini generated image 5xvw1o5xvw1o5xvw

What Exactly is “Toil”? (The “Bath Time” of Engineering)

  • Manual: Like hand-brushing a shedding Golden Retriever every single day.
  • Repetitive: If you’re solving the same problem for the tenth time, you’re just a dog chasing its tail.
  • Automatable: If a machine (or an automatic ball launcher) could do it, a human shouldn’t be stuck doing it.
  • Tactical & Reactive: It’s “firefighting” (or barking at the mailman). It’s interrupt-driven, not strategy-driven.
  • No Enduring Value: Once the task is done, the state of the world hasn’t improved. You’re just back to where you started.
gemini generated image 5xvw1o5xvw1o5xvw (1)

The 50% Rule: Keeping the Pack Healthy

Why Too Much Toil is a “Bad Dog”

  1. Career Stagnation: You can’t make a career out of “grunge.” If you only do manual work, your skills won’t grow.
  2. Low Morale: Even the best-behaved pup gets grumpy if they never get to play. Too much toil leads to burnout.
  3. Confusion: It makes people think SREs are just “ops” teams who handle manual tasks, rather than an engineering organization.
  4. Attrition: Your best engineers—the “Best in Show” types—will leave for a more rewarding job if they’re stuck doing boring, manual work.

Is Toil Always Bad?

The “Is Your SRE Team Chasing Its Tail?” Toil Indicators Checklist

Toil Indicator Checklist

  • Is your team performing the same task more than once a week? (e.g., repeatedly restarting a service, manually checking logs for common issues).
  • Are SREs spending significant time on “ticket ops” – triaging, assigning, and resolving basic, predictable tickets that don’t require novel problem-solving?
  • Do SREs manually copy-paste data between systems or manually generate reports that could be automated?
  • When onboarding a new team member, is there a long list of manual steps they must perform to get access or set up their environment?
gemini generated image 5xvw1o5xvw1o5xvw (2)
  • Are SREs frequently interrupted by alerts that are easily resolved with a standard, manual procedure? (i.e., “We just restart it when X happens”).
  • Does your team spend a large percentage of their time “firefighting” urgent, unexpected issues rather than planning and executing proactive improvements?
  • Are SREs frequently responding to requests for information or actions that could be self-serviced by other teams (e.g., “Can you restart my service?” instead of them having a button to do it)?
  • Is your on-call rotation dominated by incidents that don’t lead to long-term solutions or post-mortem action items?
  • Do tasks feel like “busy work” that, once completed, don’t fundamentally change the system or improve its long-term reliability/efficiency?
  • Are SREs maintaining legacy systems or processes that offer diminishing returns for the effort invested?
  • Are there frequent “one-off” requests from other teams that require manual intervention and don’t contribute to scalable solutions?
  • Does your team spend excessive time on administrative tasks, meetings about basic operational issues, or detailed status reports for routine activities?
  • Are SREs frequently involved in tasks outside their core SRE responsibilities (e.g., frontline customer support, manual QA, project management for non-SRE projects)?
  • Is the documentation for operational procedures constantly out of date, requiring SREs to “figure things out” each time?

Time to Unleash Your Team!

The Conclusion: Let’s Invent More and Toil Less

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *