Blue-Collar Engineering Dispatch #5: "Measure First, Automate Second"

Measure First, Automate Second

Hi there, and welcome!

It’s hard to escape AI right now. Every tool claims to save hours, replace workflows, and supercharge your team.

However, this isn’t an AI hype problem. It’s an engineering problem we already know how to avoid. So in this dispatch, I want to talk about something foundational: measurement. This isn’t about which tool to use. It’s about the mistake you make before you even pick one. You skip the baseline.

pushing buttons blindly

The Tale of PromptPilot: The AI That Saved Everything (Allegedly)

PromptPilot built software for insurance brokers. Not glamorous, but it worked. Their core product helped brokers generate client quotes by pulling data from a handful of carrier APIs and formatting the results into a PDF report. The process was manual, a broker would log in, enter client details, kick off the quote request, review the results, and send the PDF. It took a few minutes per quote and brokers ran maybe thirty quotes a day. The product worked. Customers paid. Renewal rates were solid. Then someone went to a conference.

Diego came back from that conference a changed man. He had sat through four talks about AI-powered document workflows, watched three live demos of using agents to automate, and had his badge scanned by eight different vendors with AI in their pitch decks. By the time his flight landed, he had already drafted a Slack message to the engineering team.

"We're automating the quote workflow with AI," he announced. "Brokers shouldn't have to touch this. The model reads the client data, queries the carriers, interprets the results, and drafts the report. Full automation. This is our competitive moat."

Maya, the lead engineer, asked a reasonable question: "How long does the current process take, end to end?"

Diego waved his hand. "Too long. That's the whole point."

"Right, but do we have a number? Average time per quote? Error rate on the PDFs? How often do brokers request revisions?"

Diego checked his conference notes. Nothing. He checked the product analytics dashboard. Also nothing. They had event tracking on signups and payment flows but had never instrumented the quote workflow itself. "We'll measure the improvement after we ship," Diego said. Maya flagged it in the project doc. The flag was ignored.

The team spent eleven weeks building the AI pipeline. LangChain for orchestration, a fine-tuned prompt chain for interpreting carrier responses, a PDF generation step that kept hallucinating coverage limits, and a human review queue for the cases the model flagged as uncertain. Which, early on, was about 40% of quotes.

They shipped. Brokers tried it. Some liked it. Some didn't trust the AI-generated summaries and kept using the old workflow in parallel, which the team hadn't anticipated and couldn't measure because they still had no instrumentation on the original process.

Three months after launch, Diego asked Maya for the impact report. She stared at her screen for a long moment. "We don't have a before," she said. "We have an after. We have no idea what we improved." Diego asked if they could just survey the brokers. Maya sent the survey. The results were mixed enough to be useless. Some brokers said it was faster. Some said it felt less reliable. One wrote a paragraph about how the old way was "just fine, actually."

PromptPilot had shipped a real AI feature, spent real money building it, and had no defensible answer to whether it had made anything better.

I know the story is familiar enough to sting a little. The pressure to ship AI features is real right now, and measuring the thing you're replacing feels like a delay. It isn't. It's the only way you'll know if you succeeded.

The Concept: Automation Without a Baseline Is Just a Guess

The AI hype cycle has produced one particularly expensive habit: teams automating processes they've never measured. You know the workflow is slow because people complain about it. You know it has errors because you've seen a few. So you build the automation, ship it, and then stand in front of leadership trying to explain the ROI with nothing but vibes and a demo video.

This isn't a new problem. It predates AI by decades. But AI projects have a way of amplifying it because the build cost is so much cheaper, the expectations are louder, and the failure modes are stranger.

The Overhead You're Not Counting

Manual processes look expensive because the labor is visible. An engineer spends two hours a day on a task, you see two hours. What you often don't see: how long the AI version actually takes when you include prompt latency, retry logic, human review of low-confidence outputs, and the edge cases the model handles badly. Without a measured baseline, you can't compare any of that. You're doing arithmetic with one number.

Baseline measurement also forces you to understand what you're actually automating. You find the edge cases before the model does. You discover the informal quality check someone does mid-process that nobody documented. You realize the three-minute task is three minutes on a good day and fifteen minutes when the upstream API is slow. These details matter for designing the automated version.

The Invisible Error Rate

Manual processes have error rates too. People make mistakes, miss fields, apply outdated rules. But teams often don't know their manual error rate because nobody measured it. Then the AI ships with a 5% error rate and everyone panics, without knowing whether the manual process had an 8% error rate that everyone had silently accepted. You cannot evaluate the AI's accuracy without knowing the baseline accuracy. This sounds obvious. It almost never happens in practice.

The Perception Problem

Even when AI automation performs better by every objective measure, user perception often lags. Brokers who trusted their own judgment for ten years will not immediately trust a model's output, even if the model is more consistent. If you measured nothing before launch, you can't counter that perception with data. You're stuck responding to feelings with vibes.

Measurement gives you a story you can actually tell. Before, the average quote took six minutes and had a 7% revision request rate. After three months with the new workflow, you're at three minutes and 4%. That is a conversation you can have. Without the baseline, you have a shrug.

When Skipping the Baseline Is Actually Fine

To be fair, not every AI project needs rigorous instrumentation before you start.

  • Greenfield features. If there's no existing process, there's no baseline to measure. Build it, instrument it from day one, and treat your first month of data as your baseline.

  • Low-stakes internal tooling. If you're automating something that's annoying but not business-critical, a rough before-and-after is probably sufficient. Time it yourself a few times, note it down, move on.

  • Experiments with short runways. If you're running a two-week proof of concept before committing to a full build, spending a week on instrumentation first is probably the wrong ratio.

The further a project is from these conditions, the more you need a real baseline. Customer-facing workflows, revenue-adjacent processes, anything that replaces a task your team has been doing for years: measure it before you touch it.

Hands-On: Capturing a Baseline Before You Build

The goal here is a simple, durable record of how the current process performs. You don't need a data warehouse. You need enough data to answer three questions after launch: Is it faster? Is it more accurate? Are people actually using it?

Tools:

  • A spreadsheet or notes (seriously, start here)

  • Optional: a lightweight logging wrapper in your existing code

  • Optional: a simple timing script for CLI or API-based tasks

Step 1: Define what you're measuring

Write down three to five metrics before you start collecting anything. Keep them specific and tied to outcomes, not implementation details.

Process: insurance quote generation
Metrics:

  • Time from form submission to PDF delivered (seconds)

  • Revision request rate (% of quotes that get sent back)

  • Error rate on required fields (% of PDFs missing or incorrect data)

  • Broker satisfaction (1-5 rating, sampled weekly)

  • Volume per broker per day (quotes completed)

If you can't name your metrics before you build, you don't understand what you're trying to improve well enough to automate it yet.

Step 2: Instrument the current process

If the manual process runs through your product, add timing and outcome logging before you build anything new. A minimal Python example:

import time
import logging
import uuid

logger = logging.getLogger("baseline")

def track_quote_workflow(broker_id: str, client_id: str):
    trace_id = str(uuid.uuid4())
    start = time.monotonic()

    # Existing workflow runs here
    result = run_existing_quote_workflow(broker_id, client_id)

    duration_ms = (time.monotonic() - start) * 1000

    logger.info({
        "event": "quote_completed",
        "trace_id": trace_id,
        "broker_id": broker_id,
        "client_id": client_id,
        "duration_ms": round(duration_ms),
        "pdf_field_errors": result.field_errors,
        "revision_requested": result.revision_requested,
        "workflow_version": "manual_v1"
    })

    return result

The workflow_version field is important. When you ship the AI version, you change that field to ai_v1. Now your logs let you compare both versions with the same queries.

Step 3: If there's no code to instrument, time it manually

Sometimes the process is a person doing things in a browser. That is fine. Sample it.

# Baseline Observation Log
Process: Quote generation (manual)
Observer: Maya
Period: Oct 1-14, 2025

Date       | Broker | Quote ID | Start     | End       | Duration | Errors | Revision?
-----------|--------|----------|-----------|-----------|----------|--------|----------
2025-10-01 | B-114  | Q-4421   | 10:02 AM  | 10:08 AM  | 6m 10s   | 0      | No
2025-10-01 | B-099  | Q-4422   | 11:15 AM  | 11:23 AM  | 7m 45s   | 1      | Yes
2025-10-02 | B-114  | Q-4430   | 2:30 PM   | 2:35 PM   | 4m 55s   | 0      | No
...

Thirty observations over two weeks is enough to establish a working baseline for most workflows. You want the mean, the median, and the range. The range matters because AI systems often reduce average time while increasing worst-case time, and you won't see that pitfall without it.

Step 4: Store the baseline somewhere permanent

This part matters more than people think.

Once the new system ships, the story will start changing. People will remember the old process as worse than it was. Or better than it was. Or they’ll just forget entirely.

If the baseline isn’t easy to find, it effectively doesn’t exist.

Put it somewhere your team will actually look:

  • the project repo

  • the PRD

  • the dashboard you’ll use after launch

The goal isn’t permanence. It’s visibility.

If someone asks, “Did this make things better?” you should be able to answer in under a minute, without digging through logs or Slack threads.

Step 5: Add the same instrumentation to the AI version on day one

logger.info({
    "event": "quote_completed",
    "trace_id": trace_id,
    "broker_id": broker_id,
    "client_id": client_id,
    "duration_ms": round(duration_ms),
    "pdf_field_errors": result.field_errors,
    "revision_requested": result.revision_requested,
    "workflow_version": "ai_v1",          # Changed
    "model_confidence": result.confidence, # New
    "human_review_required": result.flagged_for_review  # New
})

Now after thirty days in production you can pull both versions from the same log stream and compare them directly. You have an actual answer.

The Takeaway

AI will not tell you whether AI helped. You have to set that up yourself, before you ship. Define what good looks like, measure the current state, and keep that record somewhere permanent. The baseline is not overhead. It is the only thing that turns intuition into evidence. Build it before you build the automation.

Reader Challenge

Think about the last AI or automation project your team shipped. Do you have a documented baseline for the process it replaced? If not, what would it take to reconstruct one from logs or observation? Reply and tell me what you find. I'm curious how many teams are sitting on data they could use and haven't looked at yet.

Until next time,

Bradley

Chief Advocate for Keeping It Simple