Vercel Insights Hub

The Great Evals Debate — Ankur Goyal & Malte Ubl | Vercel Insights Hub

The Great Evals Debate — Ankur Goyal & Malte Ubl

2 months ago•December 7, 2025•34:33•2,298 views•42 likes

L

Latent Space

Channel

Interviewed Person

Malte Ubl

Description

Ankur Goyal and Malte Ubl, co-founders of Braintrust and Vercel respectively, join swyx for a spirited debate about the role of evaluations (evals) in building AI coding agents. Sparked by a viral clip of Anthropic's Boris Power stating that Claude's $500M+ coding agent business was built largely on "vibes" rather than traditional evals, this conversation digs into whether offline evals are essential infrastructure or premature optimization—and why the best teams are deliberate about investing in multiple feedback loops. We discuss: * Why *feedback loops matter more than the term "evals"*—from offline evals to A/B tests to pure vibe checks, and how the best teams deliberately invest in all three * The evolution from *golden datasets to production-driven evals:* why manufacturing test cases upfront is a waste, and how top teams now pull real user failures from logs into their eval suites daily * How *evals enable velocity:* knowing your "first derivative" (whether a change improves things) lets you ship aggressively without fear of regression, just like unit tests in traditional software * Why *coding evals are uniquely verifiable* yet still underutilized: from "does it compile?" to "does it render without errors?" and how Vercel uses these signals in RL pipelines to fine-tune models that fix trivial errors 100x faster than agentic loops * *Evals as product management:* how rubrics and LLM-as-judge scoring functions let product managers encode domain expertise (finance, healthcare) more precisely than 50-page PRDs, and why PMs are now deeply involved in eval design * The *privilege of AI labs* that build evals in-house versus the rest of the world that must make do with frontier models, and why proprietary evals are competitive moats while public benchmarks are marketing * *RL environments as the next frontier:* why they're powerful for computer-use agents and decoupling from expensive human labeling, but require specialized expertise to avoid reward hacking * An *inversion of control* for evals: why companies like Vercel should publish Next.js evals so model labs can optimize for their frameworks, creating a new marketplace where eval creators aren't the same entities training models * How *vibes are also evals*—just extraordinarily accurate, expensive scoring functions—and why the goal isn't choosing between evals and vibes but building complementary feedback loops at different speeds and costs * Practical wins from Braintrust + Vercel integration: one-click AI trace logging from Vercel apps to Braintrust, and how Vercel's composite model architecture (frontier draft + fine-tuned fix) cuts latency by orders of magnitude — Ankur Goyal * X: https://x.com/ankrgyl * LinkedIn: https://www.linkedin.com/in/ankurgoyal/ Malte Ubl * X: https://x.com/cramforce * LinkedIn: https://www.linkedin.com/in/malteubl/ Where to find Latent Space * X: https://x.com/latentspacepod * Substack: https://www.latent.space/ 00:00:00 Introduction: The Great Evals Debate 00:00:59 Background and Context: From Google Search to AI Engineering 00:02:35 The Effort-Efficiency Tradeoff in Evals 00:03:45 Why Coding is Different: Verifiability and Privilege 00:05:19 Public Benchmarks vs Internal Evals 00:08:45 Open-Endedness and the Limits of Offline Evals 00:11:26 The Workflow: From Production Logs to Offline Testing 00:12:05 When Vibes and Data Disagree 00:18:08 Evals as Product Management Tool 00:22:45 RL Environments and the Future of Evals 00:26:46 Inversion of Control: Who Should Write Evals? 00:33:36 Wrap-up and Vercel-BrainTrust Integration

Transcript

Group by:

Hi, welcome to the Lin Space Lightning Pod. We are here with for a very special episode because this is not really about a company or some launch that someone's doing. This is actually just a debate that I accidentally set off about coding agents mostly not having evals or they do have evals. It's mostly it's not really like standardized or and entirely it's a lot of it's vioded and I think a

lot of it is based on the fact of this clip that I had from Boris of cloud code saying that they basically just vibe it and that's a 500 probably higher than that $500 million business uh that doesn't have evals. So I wanted to invite Anker and Malta from Brain Trust and Verscell uh to just talk about like the state of EVAL's here and like the EVEL's debate. You guys have your own perspective. Anker, you got a bit in a in a bit of like a blogging back and forth on evals. What are you guys feeling like what's maybe a position

statement from each of you? >> Yeah, I I mean I can start a little bit like I think the like one maybe important background on me is also that I worked on Google search just before leaving for Versel. It's almost four years ago, but like I do vaguely remember it. And obviously that was also before JGBD moment and before kind of engineering becoming a mainstream thing. And so I left a world in which I was doing evolves every single day and kind of thinking that that would be it, right? Like that I wasn't really seeing

that in my future as someone joining an infrastructure company. And so it is both kind of fun to see it coming back and I think it does give me a little bit of extra experience for uh what like how it helps and how it doesn't help. One thing I tell people for sure is that like you have to put in the work, right? And and even vibe checking like as the eval like is important like if you are a product manager in Google search, you're probably doing like 500,000 searches a day. Um and if you are but then you want

to do revals because if you want to do 50,000 searches, you're not going to do that yourself. And I think there's something very similar here where essentially you just to see like, hey, I want to know if I'm doing well and how fast can I find out and like, yeah, the vibe check can tell you really fast, but if you want to know tomorrow, then the only way to do it is to have someone do the 50,000 searches or the 50,000 coding exercises, right? And if you want to know in in let's say in in 3 weeks, you can write an AB test. And if you don't care, then you you know do neither of those things. But but it is like I I think like there the way I think about

EVA is essentially like it's the thing that that can tell me tomorrow whether my change is good and um I can operate without that knowledge. But it's but it's super super helpful. >> Yeah, I have very little to add to that. I I um strongly agree. I think like fundamentally if you're building an AI product, you're dealing with this like non-deterministic magic and you don't you can't like type whether it's a model or prompt whatever you can't really type something and know what's going to happen. So you need a feedback loop and I think that fundamentally evaluate

was about are both the most challenging to build feedback loop and also the most efficient once built. And then I think AB tests are a little bit less challenging to build and a little bit less efficient. And then like pure vibes are basically like zero effort to build but then also the the least efficient. And you have to do a little bit of all of these things. And I think what I've seen both firsthand from building Loop, which is our own agent and brain trust, but much more secondhand from working

with Malta and a bunch of really great companies building AI products, is that the best people are extremely deliberate about the investment in each of the feedback loops and they're constantly questioning whether they've put enough energy into each of the feedback loops and making them um efficient. And then I think coding is it's like such an interesting topic for a few reasons. The first is it is the use case that has the most product market fit in AI I think other than chat GPT and so it is just an important and interesting use case. The

69 segments (grouped from 1008 original)7145 words~36 min readGrouped by 30s intervals