Blue-Collar Engineering Dispatch
Posts
Blue-Collar Engineering Dispatch #2: The Overmonitoring Trap

Blue-Collar Engineering Dispatch #2: The Overmonitoring Trap

When Simple Metrics Are Enough

Brad Smith
May 18, 2025

Hi there, and welcome!

Welcome to the latest issue of Blue-Collar Engineering Dispatch! This newsletter is all about building reliable, maintainable systems without falling into the trap of over-engineering. If you've ever spent weeks setting up a system only to realize it was way more than you needed—this is newsletter is for you.

Today, we're tackling another industry sacred cow: complex observability stacks and the myth that you need to monitor absolutely everything. As I like to do, we’ll start off with a tale when things go wrong.

The Tale of Schedulicious

Schedulicious was a promising startup with a scheduling app for freelancers to manage client bookings. With a mighty team of six engineers and $2 million from demo day, they were determined to build a system that would never, ever go down. "Reliability is our competitive advantage," their CTO Jaden would say in every meeting, usually followed by, "we need more metrics."

Their MVP was built on a simple Ruby on Rails monolith with a PostgreSQL database, running on two load-balanced EC2 instances. It worked flawlessly, serving their initial 300 users with near-perfect uptime. But after Jaden attended KubeCon (and a few drinks), he returned with a gleam in his eye and a new buzzword: "observability."

"How can we improve what we can't observe?" he asked the team during a Friday meeting. "We need distributed tracing, comprehensive metrics, and centralized logging. We need to monitor everything." When senior engineer VJ suggested that their current CloudWatch metrics were sufficient for their needs, Jaden dismissed him with a wave. "That's old-school monitoring. We need to be on the cutting edge."

What followed was three months of observability hell. Cue the montage.

The team set up:

A Prometheus server with hundreds of custom metrics
A full ELK stack for log aggregation
Grafana with 27 different dashboards (most created without version control and never viewed again)
Jaeger for distributed tracing (despite having only one service)
Custom health check probes that tested every endpoint every 5 seconds including DB connectivity
PagerDuty alerts for any deviation from baseline metrics

The infrastructure costs quintupled. Three of the six engineers were now dedicated to maintaining the observability stack. And the alerts? They never stopped. One engineer quit after being woken up at 3 AM because CPU utilization spiked from 15% to 20% for two minutes.

The breaking point came during an actual outage. One Friday afternoon, users couldn't log in. The system was down, and the team frantically searched through hundreds of dashboards and thousands of metrics trying to find the problem. “Why can’t we find anything?”, scowled another engineer on the incident bridge. After four hours, an embarrassed VJ discovered the issue: they had simply hit their PostgreSQL connection limit. A metric that wasn't even in their fancy dashboards, but would have been immediately obvious with basic monitoring.

"This is ridiculous," Jaden finally admitted. "We're spending more time watching the system than improving it."

The next day, they tore down most of their observability stack. They kept Prometheus for basic metrics, simplified their logging, and reduced their alerts to just what required human intervention. Their monthly AWS bill dropped by 70%. Engineer happiness improved by approximately 110%.

"Sometimes the best way to watch a system is to just let it run," VJ noted as they deleted their 27th unused Grafana dashboard.

The Observability Arms Race

While this story is a little over the top, its true that in today's tech landscape, the pressure to adopt complex monitoring solutions is intense. Conference talks, blog posts, and tool vendors all push the same message: if you're not collecting every possible metric, you're falling behind. What came first, complexity to the system or the need for complicated observability?

This has created an observability arms race where organizations compete to build the most comprehensive monitoring systems, often forgetting the original purpose: identifying and fixing problems quickly. The result? Elaborate monitoring setups that cost a fortune, difficult new employee on-boarding, and generate so much noise that actual issues get lost in the shuffle. It's like installing a thousand smoke detectors in your house. When one goes off, you'll have no idea where the fire is, or if there's a fire at all.

The Hidden Costs of Overmonitoring

Overmontioring, which may be a made up word, isn't just a theoretical problem, it creates very real costs:

Infrastructure overhead: Storing and processing metrics and logs requires significant computing resources. It is more expensive to manage your own log and metric store than you think.
Engineering time: Complex monitoring systems require constant maintenance and updates. That's time not spent improving your product.
Alert fatigue: When everything triggers an alert, engineers start ignoring them, which defeats the purpose of alerting in the first place.
Cognitive overload: Having to check dozens of dashboards creates mental fatigue and makes it harder to spot actual problems and correlation.
False positives: Complex monitoring inevitably leads to alerts triggered by normal system behavior, causing unnecessary stress and context-switching.

As a former colleague once put it: "When everything is critical then nothing is critical."

The Blue-Collar Approach to Monitoring

Instead of monitoring everything, what if we monitored the right things? Here's a more practical approach:

Start with the Four Golden Signals

Google's SRE book popularized these four critical metrics that apply to almost any system:

Latency: How long it takes to serve a request
Traffic: How much demand is placed on your system
Errors: The rate of failed requests over successful requests
Saturation: How "full" your system is (CPU, memory, disk I/O, etc.)

For most applications, these four metrics will tell you 90% of what you need to know about system health.

Here's a simple Prometheus query to capture error rates:

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

Here we are dividing the 5XX HTTP status by all HTTP requests in a 5 minute window. This returns the error rate as a percentage (when multiplied by 100).

Add Business Metrics That Matter

Beyond system health, track metrics that directly impact users or business outcomes:

Conversion rates
Active users
Core functionality usage
Cart abandonment (for e-commerce)
Revenue per hour

Here is a Prometheus query that could represent active users:

count by () (
  user_last_seen_seconds
    > (time() - 300)
)

Here we are counting how many unique active users using the metric user_last_seen_seconds in the last 5 minutes.

Implement Meaningful Alerts

Only alert on conditions that:

Indicate a real problem (not just a deviation from baseline)
Require human intervention. This is key.
Need attention now (vs. issues that can wait until morning)

Use This Simple Decision Tree for Alerts:

Before creating any alert, ask yourself these three questions in order:

Is this metric critical to system operation?
- If NO → Don't alert on it at all
- If YES → Continue to question 2
Does a deviation require immediate action?
- If NO → Create a report or dashboard item, not an alert
- If YES → Continue to question 3
Can it be fixed automatically?
- If YES → Implement auto-remediation, only alert on remediation failure
- If NO → Create an actionable alert with clear next steps

This framework will dramatically reduce alert noise while ensuring you're notified about issues that truly matter.

A Practical Monitoring Recipe

Here's a bare-minimum monitoring setup that works for most small to medium applications:

Infrastructure

CPU utilization (alert at sustained > 80%)
Memory usage (alert at sustained > 85%)
Disk usage (alert at > 85%)
Network I/O (for baseline, not typically alerting)

Application

Request rate (requests per second)
Error rate (e.g. percentage of 4xx/5xx responses)
95th percentile response time
Active sessions/users

Database

Connection pool utilization
Query performance (95th percentile query time)
Replication lag (if applicable)
Index hit rates

Alerting

High error rates (>1% over 5 minutes)
Sustained high latency (2x baseline for >10 minutes)
Host unavailable
Database connections approaching limit
Disk space critical (<15% free)

That's it. For most applications, these 15-20 metrics will tell you everything you need to know about system health. You can build this entire setup with basic Prometheus and Grafana configurations in a day or two. Or you can use your cloud providers built in tooling to do the same.

Here's what this looks like in a simple dashboard:

One dashboard, nine critical metrics. Everything you need at a glance.

When to Add Complexity

Of course, as systems grow, monitoring needs evolve. Here are legitimate triggers for adding monitoring complexity:

Multiple interdependent services: When you have real microservices that interact, distributed tracing becomes valuable.
Complex user journeys: For applications with multi-step workflows, tracking success rates through the entire funnel helps identify problems.
SLA obligations: If you've promised specific performance targets to customers, you need the metrics to prove you're meeting them.
Highly variable traffic patterns: Systems with unpredictable load spikes benefit from more detailed capacity metrics.
Security requirements: Regulated industries may need additional monitoring for compliance and threat detection.

The key is to add complexity incrementally, as actual needs arise—not preemptively based on what might happen.

Alternatives to Complex Observability Stacks

Before building a custom monitoring solution, consider these simpler alternatives:

Cloud provider metrics: AWS CloudWatch, Google Cloud Monitoring, and Azure Monitor all provide basic metrics out of the box.
Application Performance Monitoring (APM) tools: Services like New Relic, Datadog, or Dynatrace offer comprehensive monitoring with minimal setup.
Managed logging services: Instead of running your own ELK stack, consider services like Loggly, Papertrail, or CloudWatch Logs.

These managed services handle the infrastructure complexity for you, letting you focus on the metrics that matter.

The Takeaway

Monitoring, like all engineering disciplines, benefits from simplicity. Start with the minimum viable monitoring that gives you visibility into critical issues. Add complexity only when you have concrete evidence that you need more information to solve real problems. Also, add metrics or alerts around customer incidents so those same incidents are harder to repeat.

Remember, the goal of monitoring isn't to collect data, it's to ensure system reliability. If your monitoring stack is more complex than the system it's monitoring, you've probably gone too far.

Reader Challenge

Take a look at your monitoring setup and ask yourself these questions:

How many alerts have you received in the past month that required immediate action?
How many dashboards do you actually look at regularly?
What percentage of your metrics are actually used to make decisions?

Try deleting one unused dashboard or disabling one noisy alert that doesn't provide value. Then see if anyone notices.

Reply to this email with your monitoring simplification stories—I might feature them in a future issue!

Until next time,

Bradley

Chief Advocate for Keeping It Simple