#40 - Why AI Evaluations Have Never Been Optional for AI Product Managers

From traditional ML to GenAI: how evaluation became the frontline of AI product success.

Apr 27, 2025

In AI products, it’s dangerously easy to pass every technical test — and still fail the user.

In this article:

Why evaluations are the hidden foundation of great AI products
How GenAI evaluations differ radically from traditional ML
A real-world example of evaluating a GenAI system
Practical mistakes to avoid — and how to do it right

#beyondAI - If you don’t deeply understand how to evaluate AI behavior, you’re not managing a product. You’re gambling.

When we build AI products, one mistake can easily slip into the foundation without us even noticing: assuming that if the model works technically, the product will work for users.

That’s not true.

And this is exactly why AI evaluations (AI evals) are so critical — especially for AI Product Managers.

Evals aren’t just a technical health check. They are a way to make sure that the AI system performs in a way that serves the user, under the real conditions and expectations the product must fulfill. If we don’t understand evaluations deeply — and I mean beyond just “accuracy” or “precision” percentages — we risk building AI solutions that pass technical tests but fail spectacularly when they meet the real world.

I’ll be honest: I’m still learning.
And to stay honest — this is the first time I’m going this deep into the new world of AI evaluations.
Other things had priority.
(Real PM is speaking here… we always have to pick our battles.)

AI evaluations — especially for GenAI and LLMs — are a rapidly evolving field, and every project teaches me something new.

With this writing, I’m sharing what I’ve learned so far, hoping it helps others who are navigating the same shift.

The evaluation landscape changes depending on the type of AI system we’re dealing with. A traditional machine learning model (say, a churn prediction algorithm) is evaluated very differently from a GenAI model like a chatbot powered by a Large Language Model (LLM).

As AI Product Managers, we have to know the difference — and we have to know how to lead teams to design, interpret, and act on evaluation results that actually matter for the product.

Let’s get into it.

AI Evals and the Core Mission of Any Product Manager: Creating Value

At the end of the day, Product Management is about one thing: creating value for the business by solving problems for users.

Whether you’re building a mobile app, an internal tool, or an AI system, this doesn’t change.

The twist with AI products — and especially GenAI products — is that you can’t separate the technical behavior of the system from the user experience it creates.

If the AI model behaves poorly, feels unreliable, or simply doesn’t align with user expectations, it directly kills the value you’re trying to generate.

In traditional software, this separation is possible.

A button might technically work — it sends a request to the backend, triggers the right workflow, and returns the expected result — even if the visual design or wording isn’t perfect.

In other words: technical functionality and user experience are distinct layers.

You can fix usability later without needing to change how the system itself computes or behaves internally.

But in AI, especially in GenAI, the system’s behavior is the user experience.

There’s no “under the hood” you can separate cleanly.

When an AI writes an email reply, generates a product description, or answers a customer question, the output is the product.

There’s no layer between the user and the system’s core behavior — no abstraction shield.

And this is why evaluations in AI Product Management are not optional or secondary.

They are essential to ensuring that the product fulfills its real-world purpose — and that the business actually captures the value it hopes to create.

Traditional AI Evaluations: Clear Metrics, Narrow Scenarios

In traditional machine learning, evaluations are mostly built around structured outputs.

You might predict a binary outcome (“Will this customer churn?”) or a continuous value (“What’s the estimated price of this house?”).

To evaluate such models, the industry has relied on metrics like:

Accuracy
Precision and Recall
F1 Score
ROC-AUC
Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE)

The key thing here is: the evaluation criteria are objective and quantifiable.

You know the ground truth (whether a customer actually churned or not).

You know the prediction.

You can run the math and get a score.

The role of the AI PM in this setup is relatively straightforward:

Define which metric matters for the business use case (e.g., precision over recall if false positives are very costly).
Align the team on the target thresholds that mean “good enough” to ship.
Monitor model drift or degradation over time.

In short: in traditional ML, evaluations are clean, comparable, and repeatable.

But then came GenAI — and the game changed.

GenAI and LLM Evaluations: Messy Outputs, Moving Targets

When you move into GenAI, especially with LLMs, the nature of the output changes completely.

Instead of predicting a simple label, the AI now generates free-form text, images, even code.

The output space is basically infinite.

And because there is often no single “ground truth” answer, traditional metrics break down.

You can’t easily calculate “accuracy” on a chatbot that gives slightly different but equally acceptable responses to the same question.

You can’t just look at a number and say “this model is ready.”

Evaluating GenAI models involves concepts like:

Human-likeness
Factual correctness
Relevance to the query
Completeness of answer
Bias, toxicity, and harmful content detection
Style and tone alignment

And to make it even more complicated: human judgment is often needed.

Human evaluators have to assess if an LLM’s response was helpful, respectful, in line with brand voice, or free of hallucinations.

In practice, evaluation setups for GenAI now involve a mix of:

Prompt-based testing (feeding in test prompts and evaluating outputs)
Rubrics for human raters (scoring outputs against subjective quality criteria)
Automated evals using smaller models (“critique models”) trained to assess the main model
Red teaming (actively trying to break the model by feeding adversarial prompts)

What AI Evaluation Looks Like in Practice

Let’s take a practical example:

Imagine you are shipping an internal GenAI tool for customer service agents. The AI suggests draft replies to customer emails. An evaluation setup might look like this:

You define 100 typical customer emails covering different topics: billing issues, product complaints, upgrade requests, cancellations, technical support.
You feed these emails into the GenAI system and collect the draft replies.
Human reviewers — preferably real customer service agents — score the AI replies on dimensions like:
- Relevance: Does the reply actually address the customer’s issue?
- Tone: Is it polite, professional, and empathetic?
- Accuracy: Are factual statements (e.g., refund timelines, account policies) correct?
- Actionability: Is the reply clear on the next steps for the customer?

Each of these dimensions might be scored on a scale from 1 to 5.

You might find, for example:

85% of replies are relevant (good)
78% are in the right tone (okay)
65% are factually correct (problem)
60% are actionable (problem)

As an AI PM, you now have concrete signals.

You don’t just “hope” the model is good — you know where it’s failing for the business and where fine-tuning, guardrails, or post-processing might be needed.

That’s the real life of AI evaluation: structured, messy, human-in-the-loop, and critically important.

How AI Evals Relate to UX: The Hidden Parallel

The more you work with GenAI evaluations, the more you realize: AI evals have more in common with user experience (UX) testing than traditional software testing.

When we build normal software (no AI involved), we know that:

The code compiles or it doesn’t.
The feature works or it doesn’t.
The button clicks through or it doesn’t.

But even if everything works technically, the user might still hate the experience.

Maybe the button is hidden.

Maybe the flow is confusing.

Maybe the error message feels rude.

The only way to find this out?

UX testing.

You need to observe how users interact with the product in real conditions. You need feedback that’s not about whether something works, but how well it fits into the user’s life.

With GenAI, it’s the same.

A model might respond to a prompt. Technically, it “works.”

But is it clear?

Is it respectful?

Is it helpful?

Is it concise enough, or too verbose?

Evaluations for GenAI are essentially a form of UX testing for AI behavior.

What This Means for AI Product Managers

As AI PMs, our role is to bring evaluation to the center of product thinking, not treat it like a final hurdle before release.

Here’s what this practically looks like:

Define evaluation goals early.
Before any data science starts, define what “good” looks like for the user — not just technical performance, but experience quality.
Mix automated and human evaluations.
Understand that GenAI models need both structured evals (where possible) and subjective assessments.
Create meaningful prompt sets.
Work with your team to design realistic, diverse prompts that represent the full range of user behavior — not just the happy paths.
Continuously test and monitor.
GenAI models can drift not only in technical performance but also in tone, helpfulness, and safety.
Ongoing evaluations are not optional.
Translate eval results into product decisions.
Don’t just hand over evaluation reports to data scientists. Interpret them in the light of business goals and user experience expectations.

Common Mistakes in AI Evaluations

Even experienced teams can stumble when it comes to evaluating AI products.

Some of the most common mistakes I’ve seen:

Focusing only on technical accuracy:
High accuracy scores don’t mean users will trust, enjoy, or even accept the AI’s behavior.
Testing only the “happy paths”:
Evaluation sets often miss real-world edge cases, sarcasm, ambiguous queries, or hostile prompts.
Using unrealistic test data:
Clean, idealized prompts make the model look good. Real users don’t write like a textbook.
Skipping human evaluation steps:
Relying only on automated scores might be faster, but it often misses subtle but critical issues like tone, clarity, or user perception.

Avoiding these mistakes isn’t just about better testing — it’s about delivering an AI experience that feels trustworthy and valuable to real users.

First Steps Checklist for AI PMs

When you’re building your evaluation plan, keep it simple to start:

Define your key evaluation dimensions:
What does success look like to users — relevance, clarity, helpfulness, safety?
Design realistic prompt sets:
Use messy, real-world examples, not sanitized ones.
Mix human and automated evaluations:
Don’t rely on metrics alone — integrate human review cycles.
Evaluate continuously, not just at launch:
Plan for monitoring model performance after deployment.

Final Note

The more I work with AI products, the clearer it becomes: We’re not just building features. We’re shaping behaviors, expectations, and trust. Evaluation isn’t a side task. It’s how we stay honest with ourselves — and with the people we build for. It’s how we check if what we’re creating is truly helping — or just adding noise to an already noisy world.

And evaluation is how we stay close enough to see that truth — and strong enough to act when what we see isn’t good enough yet.

That’s the kind of AI Product Management I believe in.

JBK 🕊️

Snigdha Sharma

Apr 29

Well said, GenAI testing does feel a lot like UX testing. Even more so for features allowing free form input from users.

Another thing that I have observed is how leaders interpret these human testing results. Just because the test scores were low in the first try doesn't mean that the feature doesn't have any potential. In fact, the better the testing setup, worse first cut results are to be expected. It becomes a lot easier to improve the features with good real test data.

Expand full comment