Kycha Blog.About

Inside the AI Harness: Engineering Techniques for Predictable and Deterministic AI

Cover Image for Inside the AI Harness: Engineering Techniques for Predictable and Deterministic AI
Andrii Kycha
Andrii Kycha

Inside the AI Harness: Engineering Techniques for Predictable and Deterministic AI

Practical lessons from building AI features that ship

AI is not magic. Behind every AI feature you see in a product there is a harness - a structured set of components that consistently turn a model into something useful and reliable. If you only have a model and a prompt, you do not have a production AI feature (yet).

To solve this, AI products of all sizes do not rely on the model alone. They put the model inside a harness. Think of the AI harness as the engineering around the model that makes AI results repeatable, testable, explainable, and safe to use.

The article is for beginner AI developers and product people who want to learn what an AI harness is and explore AI technical design that helps build predictable and deterministic products.

If you understand web services, APIs, and basic AI fundamentals, you should be comfortable following along.


What Is an AI Harness

AI harness building blocks diagram

When I first started reading AI product blogs, I kept seeing people casually mention an AI harness like it was some well known concept everyone should already understand. What exactly is this harness everyone keeps talking about? Is it a framework? A pattern? Some secret tool I somehow missed?

Only later I realized the truth: the harness is not a single thing. It is the entire engineered system wrapped around the model. An AI harness is everything that keeps an LLM predictable, deterministic, and safe in production. It transforms the model from a mysterious black box into a controlled process you can trust, debug, and improve over time.

Building blocks of an AI harness:

  • Orchestrator - controls the conversation flow, tools, memory, and output format
  • Prompt - instructions you pass to the LLM to let it know what problems it solves and how it solves them
  • Context - the information that isn't publicly available and the model can't possibly know without you giving it to it
  • Tool registry (MCP server) - defines what tools the model can use and how it calls them
  • Evaluator - measures model output quality to avoid regressions
  • Basic components - business logic, databases

None of these replace the model. They help the model behave like a reliable component.


Why Determinism Matters

Non-deterministic AI agent diagram

Imagine launching a new AI feature that demoed perfectly last week. Everyone loved it. The answers were crisp, the tool calls were correct, and the product team finally felt confident.

Now imagine showing the same feature today, in front of the same people, using the same inputs... and the AI suddenly decides to take a different route. The answer looks unfamiliar and inaccurate.

Nothing changed in your code. But something changed in the model.

This is the moment every team experiences eventually. It feels like trying to fix a radio that slightly retunes itself every morning. The signal is still there, but it keeps drifting.

That drifting is what happens when an AI system lacks determinism. When determinism is missing:

  • users get different results for the same question
  • bugs become harder to reproduce
  • small prompt changes break working scenarios

AI models are probabilistic by nature. They do not promise the same answer twice, even when everything seems identical.

👉 The role of the harness is to make the AI behave predictably enough that developers can build real product features on top of it.

The model provides intelligence.
The harness provides consistency.

Together, they create something you can trust in production.


AI Harness Basic Techniques

Before we dive into the more advanced techniques behind determinism and predictability, it helps to look at the basics. These are the foundational building blocks I used long before I ever heard the term AI harness. They naturally push your AI products toward more predictable and deterministic behavior, and you will see them used across many teams and products, both in existing systems and in new ones being built every day.

These basic techniques include:

1. Structured Outputs
Instead of letting the model improvise paragraphs, the harness asks for structured output such as JSON that follows a strict schema. This makes the model's responses easier to parse, validate, and compare. A model that returns JSON is far more predictable than a model that returns free text.

Check Open AI documentation for more details.

2. Validating Outputs
Once the model returns structured data, the harness validates the output before using it. This can mean checking the JSON schema, verifying required fields, or ensuring values fall inside expected ranges. If the output is invalid, the harness can retry, request a corrected response, or fall back to a safe default. Validation acts as the first safety net against unpredictable behavior.

It can be implemented as a simple JSON schema check using Zod, or as something more sophisticated, like a separate LLM layer with its own prompt and validation instructions.

3. Prompt Templates
Prompts rarely live as strings in code. In a harness, they are templates. Templates let you inject data dynamically, enforce a consistent style, and avoid accidental prompt drift. They also make prompts versionable, testable, and reusable across flows. Instead of rewriting complex instructions, the harness composes them from smaller template parts.

4. Few Shot Examples
Developers often underestimate how effective a few good examples can be. Showing the model how to respond in two or three carefully chosen scenarios often stabilizes the output more than adding more rules. Few shot examples act as a behavioral anchor for the model, guiding style, structure, tone, and reasoning.

These battle-tested techniques will help you day to day and are easy to bring into your next project. But there are also a few newer techniques I've picked up recently, and they deserve a closer look. Let's dive in.


Technique 1: Preserve Tool Calls and Model Reasoning

Preserve Tool Calls High Level Diagram

LLMs today are not only text generators. They can call tools. Think of a user opening your app and typing "Show me details about Andrii." The model cannot know this answer because the real data lives in your company database. So instead of guessing, the LLM reaches for a tool - a deterministic function that knows exactly how to look up Andrii's record. A tool can be anything that runs predictable logic: a database query, an API call, a calculation, or a piece of business logic that always gives the same result for the same input.

When an LLM decides to call a tool, the orchestrator:

  • logs the reasoning path
  • stores the tool call arguments
  • stores the result from the tool

And that's it, now your users can easily look up people in the database using an AI agent.

Question:
Why store tool results at all? Technically, this flow could work without writing anything to the database.

👉 The real value appears when the user asks follow-up questions, the LLM needs to recall exactly what it returned earlier and why it made that decision.

Preserve Tool Calls High Level Diagram

The LLM calls the database tool and discovers there are three different people named Andrii in the database. Instead of guessing, it does the right thing: it asks the user to clarify which Andrii they meant by providing a last name or email.

When the user replies with "Andrii Kycha", the LLM must continue from the exact list of candidates it previously retrieved.

By storing the tool outputs, the system keeps a precise snapshot of the original query and the three Andriis it found, allowing the LLM to resume confidently from the same state and keep the conversation consistent, predictable, and grounded in real data.

This is why storing tool results and passing them back to LLM with the follow-up questions from the user makes your AI products more predictable and deterministic. Let's move on to the next technique!


Technique 2: Deterministic Configuration for LLM Tools

Picture this: a user opens your app and asks, "Can you show me my most recent paycheck?" Straightforward request. The LLM prepares to call the payroll tool, which needs a user UUID to fetch the correct record.

Your prompt already includes the real user UUID. Everything should go smoothly. Except it doesn't.

Erroneous paycheck AI agent flow example

For reasons only a probabilistic model understands, the LLM decides to "help" by generating its own user_id argument in the tool call. Maybe it subtly rewrites the UUID. Maybe it picks a different one entirely because the pattern looks similar. The tool runs deterministically, of course, but now it is deterministically fetching someone else's paycheck.

The output is clean, confident... and completely wrong.

This is the kind of bug that is nearly impossible to reproduce because the model does not make the same mistake every time. One run it uses the correct UUID, the next it hallucinates a new one, and you end up chasing ghosts in production logs.

👉 The solution is simple but powerful: configure your LLM tools deterministically whenever possible.

Do not let the model improvise critical parameters. Hardcode or inject them directly into the tool context. Pass system-level metadata explicitly so the model cannot override it. The more the harness handles for the model, the less room there is for hallucinated arguments sneaking into tool calls.

Corrected paycheck AI agent flow example

It is the difference between "please guess the UUID" and "here is the exact UUID, do not touch it."

This one change can eliminate entire classes of bugs and make your harness reliably predictable in real-world scenarios where correctness truly matters.


Technique 3: Model Evaluation

I have always loved writing unit tests. They are quick to add, fast to run, and great at catching all the tiny logic mistakes that hide in a codebase. If a function starts behaving differently than expected, the tests fail right away. It is one of my favorite feedback loops in software development.

But once an LLM enters the picture, that comfort disappears.

You can still test your API logic, data flow, and business rules... but you cannot test the model's output the same way. You cannot assert that "the model must return this exact sentence" because it won't. And even if it does today, it may not tomorrow. LLMs generate language, not exact strings.

So instead of testing the text, you evaluate the behavior. That is where model evaluation comes in.

The goal is to check whether the model still solves the tasks it is supposed to solve:

  • Does it identify the correct person status?
  • Does it extract the needed data from a conversation?
  • Does it follow the correct business flow without skipping steps?

However, evaluation tests have a cost. They are slower, harder to maintain, and sensitive to changes in prompts or model versions.

Testing Pyramid

This is where the testing pyramid becomes helpful. At the base, you rely on fast, stable unit tests for all the deterministic logic. At the top, you add only a small number of evaluation tests, focused on the most important business flows.

If you try to test everything with evaluation tests, the suite becomes fragile and time-consuming. A tiny prompt adjustment can break many tests even though the product still works.

So the balance looks like this:

  • Let unit tests protect the predictable parts of your system
  • Use evaluation tests only for the critical scenarios the model must handle correctly

This mix gives you confidence without overwhelming you with test maintenance, allowing both your code and your model to evolve safely.


How These Techniques Work Together

These techniques are not separate modules. They reinforce each other.

  • preserved tool calls and reasoning allow answering follow-up questions
  • deterministic tools reduce hallucinations
  • evaluation detects regressions early

Once you build this loop, AI development feels more like software development again. You can reason about changes, track behavior, and debug failures.

Example of a simplified, but complete AI harness is shown on the diagram below.

AI harness high level design diagram

In a prototype, the model is the product.

In a production system, the model is just one piece.

The harness is what makes AI fit into product cycles, testing cycles, release cycles, and incident response.

Without a harness, teams depend on intuition and luck. With a harness, teams work with evidence.


Closing Thoughts

There is a lot of hype around AI today, but it's important to get inspiration from well-defined and battle-tested engineering practice when delivering AI products. If you want predictable results from a probabilistic system, you add structured control around it.

The techniques we touched upon today:

  • Basic techniques including structured outputs, validating outputs, few shot examples
  • Persisting tool results and model reasoning
  • Deterministic tool configuration
  • Model evaluation

If you want your AI system to be something your team trusts, focus on determinism and predictability first. The harness is how you get there.

If you liked this article, consider checking my other AI-related articles:

Until next time and happy coding! 🧑‍💻