- Blue-Collar Engineering Dispatch
- Posts
- Blue-Collar Engineering Dispatch #4: Runbooks That Don’t Suck
Blue-Collar Engineering Dispatch #4: Runbooks That Don’t Suck
Runbooks That Don’t Suck
Hi there, and welcome!
This week we’re talking about runbooks. Not the dusty wiki pages that haven’t been updated since your last engineer quit. The ones that actually work.
The Tale of ProcedurePro: The Runbook That Cried Wolf

blindly pushing buttons
ProcedurePro had a runbook for everything. That was the whole point. The company sold procedure management software to healthcare companies, and their internal wiki was their proof of concept. Every system had documentation. Every alert had a response guide. Investors loved it!
Markus had been there three weeks when the Redis cluster decided to stop accepting connections at 2:47 AM. His phone screamed. He pulled up the wiki, feeling confident. They had a runbook for this. This wiki had a link to another document:
Redis Emergency Recovery Procedures v2.3 (FINAL) (FINAL-FINAL) (USE THIS ONE).docx
Markus thought this was a bit strange but followed the link and opened up the doc just the same. There were steps laid out and numbered. A small bit of relief started washing over Markus.
Step 1: “SSH into the primary Redis node.”
Markus squinted at the screen. Which one was primary? The doc didn’t say. He dug through six months of Slack, found a thread where someone mentioned “redis-prod-1” was primary. Or maybe “redis-primary-1”? The thread was inconclusive.
He guessed wrong.
Step 4: “Run the recovery script located in the usual place.”
“The usual place?” Markus muttered. He checked /scripts. Empty. He checked /opt/redis/scripts. Nothing. He tried every pattern he could think of and fired off ripgrep in a 1000 directions. He finally found /home/deploy/scripts with one file: old_recovery_DO_NOT_USE.sh. There was nothing else and the runbook did say to run it, so he executed it. The script referenced a config file that had been deleted and a Redis version two major releases behind. Errors plastered the terminal like graffiti in a different language. Markus started panicking. He was new and wanted to make a good impression, but things are not going well. Markus checked the runbook again. There was another step!
Step 7: “If issues persist, contact the Redis team lead.”
Markus started looking in the company wiki for the Redis team. Surely they would know what to do. Upon further investigation, the Redis team was someone named Jaden. Jaden had left eight months ago. His email bounced. His Slack was deactivated. Legend had it he moved to Portugal to raise goats. “Why is nothing working?” Markus yelled at his laptop. “We have documentation for EVERYTHING.”
Finally, Sarah from the platform team joined the incident call at 3:15 AM. Markus explained what he had been doing for 30 minutes. “Oh, that runbook? Nobody’s used that in over a year. Let me walk you through what we actually do.”, Sarah said. Twenty minutes later, with Sarah narrating the real steps, Redis was back. Markus took notes furiously. The next morning, he updated the wiki with what actually happened: the real hostnames, the actual commands, his own phone number instead of Jaden’s.
Two months later, Markus left for a startup that paid better. His updated runbook sat in a folder nobody checked. The next engineer to get paged would find the original doc, still labeled FINAL-FINAL, and start the cycle all over again.
True story? Does it matter? This pattern is brutally familiar to anyone who has been on call. In a recent Transposit survey, 74% of incident responders said outdated or incomplete runbooks directly contributed to longer incident resolution times. Runbooks don’t fail because engineers don’t care; they fail because the way we write, review, and maintain them is fundamentally broken, especially under constant change. With AI now threaded through every tool and workflow, having a dependable, human-validated source of truth to anchor that automation has never been more critical.
The Concept: Why Runbooks Rot
Runbooks have an incident-response reputation, but that’s selling them short. Any task a human does more than once deserves a runbook: deployments, credential rotation, customer onboarding, environment setup. If you’ve explained it twice in Slack, it should be a runbook. I know, automating these tasks is “better” long term, but then who knows how to fix the automation when it breaks?
So why do they keep failing?
Written from Memory, Not Observation
Someone fixes a problem. Days later, someone suggests documenting it. The engineer sits down and writes what they remember doing. Memory lies. We skip steps that feel obvious. We forget the weird workaround we did because the standard approach didn’t work. The doc describes the ideal procedure, not the real one with all its quirks. A new engineer tries to follow it, gets stuck, and asks in Slack. Now you’re explaining it all over again.
Static in a World That Changes Weekly
Your infrastructure evolves. Dependencies update. That endpoint moved. The dashboard got renamed. The runbook stays frozen, describing a system that no longer exists. According to PagerDuty’s 2024 State of Digital Operations report, teams spend an average of 9.4 hours per week on documentation maintenance, yet 68% still report their operational docs are out of date. There’s no feedback loop. No mechanism to fold lessons back in. Updating docs feels like homework, so it doesn’t happen.
Formatted for Reading, Not Doing
Most runbooks live in wikis optimized for writing, not for executing under pressure. Walls of text. Vague instructions like “check the logs” with no specifics. Links to login pages instead of actual dashboards. When you’re tired at 3 AM or juggling three things at 2 PM, none of this works.
The Mindset Shift: Runbooks as Recordings
Here’s what changes everything: think of runbooks as recordings of real work, not documentation of imagined work. Capture how tasks actually get done. Document as you execute, not after. Treat it like screen recording for ops. Include the detours, the “wait, I also had to do this” moments, the step where you waited 30 seconds for something to propagate. Every execution becomes a revision opportunity. The person who just ran the procedure is the best possible editor. They know exactly what was unclear and what was missing.
Make the update tax nearly zero. If updating takes 30 minutes, it won’t happen. A quick note saying “as of March 2025, also restart the worker service” is infinitely more valuable than a pristine doc missing that step.
For onboarding new engineers, this is transformative. A new engineer with a good “set up your dev environment” runbook isn’t blocked waiting for someone’s calendar. “How we deploy” becomes something they can follow in week one. Tribal knowledge becomes transferable on day one.
When Static Documentation Still Makes Sense
Not everything needs to be a living runbook. Static documentation works fine for:
Conceptual overviews: Architecture diagrams, system design docs, and ADRs explain the “why” and don’t change with every deployment
Reference material: API docs, configuration options, and schema definitions are updated with code changes, not procedure executions
Compliance artifacts: Some audits require point-in-time documentation that shouldn’t be casually edited
The distinction: if it’s a procedure someone follows step-by-step, make it a living runbook. If it’s reference material someone reads for understanding, static docs are fine.
Hands-On: The Five Sections Every Runbook Needs
Regardless of what tool you use, these five sections should be non-negotiable.
1. Overview
What this runbook is for and when to use it. One paragraph max.
## Overview
Use this runbook when the Redis memory usage alert fires (>80% utilization)or when the application logs show "Redis connection refused" errors.2. System Context
What systems does this touch? What access do you need? What could break?
## System Context
**Systems involved:** Redis cluster (redis-prod-1 through redis-prod-3), API servers, background workers
**Prerequisites:**
- SSH access to production hosts (request via IT if needed)
- Membership in `ops-oncall` PagerDuty group
- Access to Grafana dashboards
**What could break:** Restarting Redis will cause ~30 seconds of increased latency. Background jobs will retry automatically.3. Procedure
Step-by-step instructions. Commands should be copy-pasteable.
## Procedure
1. Check current memory usage:
```bash
ssh redis-prod-1.internal.example.com
redis-cli INFO memory | grep used_memory_human
```
2. If above 80%, trigger a manual eviction:
```bash
redis-cli --no-auth-warning MEMORY PURGE
```
3. If memory doesn’t drop, check for large keys:
```bash
redis-cli --bigkeys
```
4. Validation
How do you know it worked? Be specific.
## Validation
- [ ] Memory usage below 70% (check Grafana: [dashboard link])
- [ ] No "connection refused" errors in app logs for 5 minutes
- [ ] Background job queue processing normally (Sidekiq dashboard shows no backlog)5. Rollback
How to undo if things go wrong.
## Rollback
If Redis fails to restart:
1. Check logs: `journalctl -u redis -n 100`
2. Restore from last snapshot: `redis-cli DEBUG RELOAD`
3. If still broken, escalate to #platform-oncall (Sarah, Mike, or Ajai)
This procedure cannot make things worse than a non-responsive Redis, but escalate immediately if restart fails.The Takeaway
Runbooks fail when they describe how we think work happens instead of how it actually happens. They fail when updating them feels like homework. They fail when they’re formatted for reading instead of doing.
The fix: treat runbooks as recordings of real work. Capture procedures as you execute them. Update after every use. Structure them for someone who’s tired, distracted, or new to the team. Five sections, every time: Overview, System Context, Procedure, Validation, Rollback. Whether it’s incident response, onboarding, or any task your team does twice.
Reader Challenge
Pick one task your team explains repeatedly in Slack. Next time someone does it, capture every step as they go. Turn it into a runbook with all five sections. Then hand it to someone who’s never done that task and see if they can follow it cold.
Fix what breaks. That’s your first real runbook.
Until next time,
Bradley
Chief Advocate for Keeping It Simple