Most AI demos are lies.
Not intentional ones. The model genuinely did the thing. The output looked great. But the demo ran in a clean notebook, with hand-crafted inputs, on a task someone already knew the answer to. Ship that same agent into a real workflow and it falls apart by Tuesday.
A recent survey out of BeConfident Labs calls this the '98% problem.' The model itself — the raw intelligence — is maybe 2% of what makes an agent production-ready. The other 98% is the harness: the scaffolding, the guardrails, the retry logic, the context management, the output validation, the error handling. All the unglamorous infrastructure that nobody demos.
What 'Harness Engineering' Actually Means
Harness engineering is the work of making an AI agent behave predictably under real conditions — not just once, but every time.
Think of the model as an engine. The harness is everything else: the chassis, the brakes, the fuel system, the dashboard. You wouldn't drive a car that was just an engine bolted to four wheels. But that's essentially what most people deploy when they call something an 'AI agent.'
In practice, harness engineering covers things like:
- Context windows and memory management — deciding what the agent knows at each step, and what it forgets
- Tool call validation — checking that the agent is actually using tools correctly before you let it touch live data
- Retry and fallback logic — what happens when the model returns garbage or times out
- Output parsing and schema enforcement — making sure the response is actually usable downstream
- Observability — logging enough to debug failures without drowning in noise
None of this is exciting. All of it is load-bearing.
Why Solo Operators Feel This Pain Hardest
Large teams can absorb the harness problem. They hire ML engineers, platform engineers, QA. Someone owns reliability. Someone else owns tooling. The work gets distributed.
Solo operators don't have that luxury. You're the model, the harness, the QA, and the customer. When an agent breaks at 11pm before a client deliverable, there's no escalation path. It's just you and a stack trace.
The survey's insight lands differently when you're operating alone: the 98% that isn't the model is 98% of your time. Every hour you spend debugging why an agent looped, why it hallucinated a file path, why it called the wrong tool — that's an hour you're not delivering work or building the business.
This is why 'just use GPT-4' is incomplete advice. The model is the easy part. The harness is the job.
The Patterns That Actually Work
The survey maps out several harness patterns that show up consistently in reliable agent deployments. A few worth knowing:
Structured output enforcement. Don't let the model free-form its response if you need to parse it. Define a schema upfront, validate against it, and reject anything that doesn't conform. Sounds obvious. Most people skip it until something breaks in production.
Step-level checkpointing. For multi-step agents, save state at each step. If something fails at step 7, you restart from step 6 — not from zero. This alone cuts debugging time significantly.
Narrow tool surfaces. The more tools an agent can call, the more ways it can go wrong. Give it exactly the tools it needs for the task. Nothing extra. Scope creep in tooling is a reliability killer.
Explicit failure modes. Define what 'failure' looks like before you deploy. What should the agent do when it's uncertain? When it hits a rate limit? When the input is malformed? Agents without explicit failure handling make up their own answers — and those answers are usually wrong.
The Uncomfortable Implication
If the model is 2% of the problem, then chasing better models is mostly a distraction.
GPT-4o vs. Claude 3.5 vs. Gemini 1.5 — these debates matter at the margin. But if your harness is weak, upgrading the model is like putting a better engine in a car with no brakes. You just fail faster.
The operators who are actually getting value from AI agents right now aren't the ones who found the best model. They're the ones who built boring, reliable harnesses around whatever model they started with — and then iterated from there.
What To Do With This
If you're running AI agents in any part of your workflow, do one thing this week: audit your failure handling. Pick one agent, trace what happens when it gets a bad input or a timeout, and write down what actually occurs versus what you'd want to occur. That gap — between actual and intended failure behavior — is your harness debt.
The 98% problem isn't going away. But it's solvable, one unglamorous piece at a time.
