“There’s a reason that although predictive weather models have exploded in complexity, the user-facing visualizations remain 1950s simple.”

The problem that product managers face

As product managers doing our best to tackle real-world problems with today’s AI capabilities and who have also compared notes with numerous other product leaders facing the same challenges, it appears that product teams across many industries find it surprisingly hard to identify the best AI model for their products.

Selecting the right model has always been one of the most challenging tasks for data scientists—long before generative AI or LLMs arrived on the scene. The key is achieving the highest performance on specific metrics and getting the best results across multiple critical product attributes such as business license, data ownership, and system resource requirements.

This process requires a scientific mindset—one that prioritizes understanding the data and rigorously evaluating how well the model’s output supports the whole product’s needs. This is in contrast to the ‘get it out’ mindset of low-risk web apps since the risks of getting AI wrong are still too great to ignore.

Current tools for AI models

The good news is that there are more data points than ever, with new AI models, evaluation methods, and leaderboards appearing almost daily. The challenge is that each of these tools is optimized for a different audience and purpose, and most are still closer to research-land than product-land. Even so, it’s a good start, and inspiration for some ideas to take things further (more on that later…).

Hugging Face

Hugging Face is a vibrant marketplace of ideas and metrics, but way too technical for most people — even engineers who are not steeped in the specifics of what each test aims to measure, which data the tests require, and how to interpret the results (especially important since the results can often be gamed).

Chatbot Arena

Chatbot Arena is at the other end of the spectrum: entirely qualitative based on human head-to-head testing of competing LLMs. On the plus side, it’s great because the criteria are entirely subjective and measured by people, so it tends to favor more natural responses. But since any tester can input any prompt they want, and only test two models at a time, it is hard to compare results between tests.

Artificial Analysis

Artificial Analysis is a big step toward the kinds of metrics that product managers care about: speed, quality, cost — and how these factors interact. But it doesn’t cover other critical factors, such as license type that can make or break a product decision.

Galileo’s LLM Hallucination Index

The LLM Hallucination Index is a data-rich deep dive into hallucination among leading LLMs —product leaders’s biggest concern about deploying AI in production. But as a hand-crafted research report, it will go stale without additional human-powered updates.

Hughes Hallucination Evaluation Model

Last, but not least, the Hughes Hallucination Evaluation Model leaderboard is a programmatic ranking of how much each of the top 25 LLMs hallucinates. Its evaluation model is open source, so technical teams can also do their own testing.

While plenty of tools are available, what often goes unmentioned is the need for an alignment between technical performance and business goals. The best model by traditional benchmarks isn’t always for a specific product or context.

What’s needed now when it comes to AI models

While multiple tools and data sources are available to evaluate AI model performance, the gaps between technical benchmarks, business goals, and product needs remain significant. What’s clear is that selecting the right AI model is no longer just about optimizing for metrics; it requires a systematic and iterative process that aligns decisions with user needs, evolving data contexts, and overarching business objectives.

The key is to strike a balance between complexity and clarity. Teams need tools and frameworks that are data-rich and actionable—clear enough to guide decision-making without oversimplifying or overwhelming. Much like a reliable weather forecast, these solutions should empower teams with trust and confidence to navigate this challenging landscape.

For readers eager to explore these concepts further, here are some additional resources and people we recommend following who address AI strategy and AI decision-making:

  1. “The Paradox of Choice” by Barry Schwartz
    It is a classic exploration of how too many options can hinder decision-making and lead to choice paralysis. A bit about the book 
  2. If you have access to AWS Bedrock provides practical insights into comparing and selecting fine-tuned and based open source or proprietary AI models Learn more here.
  3. A CIO and CTO technology guide to generative AI | McKinsey 
  4. Mixture of Experts Podcast by IBM Mixture of Experts —IBM podcast 

People to Follow:

  1. Ben Simo – Tricentis | LinkedIn – Applying best-in-class QA methods to Generative AI
  2. Vin Vashista | LinkedIn – Always quick with the facts (and funny memes) about AI Product Strategy
  3. Alex Strick van Linschoten | LinkedIn – Talks about ML Engineering Data Science
  4. Hugo Bowne Anderson | Vanishing Gradients – Great podcast on the topics of GenAI & Evals
  5. Uday Kumar | Product Strategy and Management Leader/GDE and of course, the
  6. Instructors of the Udacity AI Product Manager Nanodegree

With the right tools, frameworks, and mindset, teams can move beyond being “blinded by science” and toward building AI-powered products that are as effective in practice as they are impressive in theory.

About us

Noble Ackerson and Brad Nemer met as Udacity AI Product Manager Nanodegree instructors. When not teaching, Noble focuses on AI risk management and product strategy, and Brad helps companies get from zero to one, from one to ten, and sometimes to eleven.

Noble Ackerson
Noble Ackerson
Award-winning AI product leader and ML engineer, Director of Product at Ventera, and Chief Technology Officer at the American Board of Design & Research. With over 15 years of experience in product management, Noble has coached startups and established organizations in creating sustainable, repeatable, and successful products that achieve their purpose and vision.