How AI broke serverless and what to do about it with Vercel’s Mariano Fernández Cocirio

5 months agoMarch 6, 2025
13:53
70 views
1 likes
R

Rootly

Channel

Interviewed Person

Mariano Cocirio

Description

Mariano, Staff Product Manager at Vercel, explains why serverless architectures are hitting unexpected limits—they’re too fast. The industry has spent millions optimizing serverless for speed, but AI workloads are changing the game. In the AI realm, slower execution often leads to better results. The challenge? Paying for all that idle compute time while waiting for AI responses. Mariano explains how Vercel Fluid is introducing a new execution model that blends the best of serverless and traditional servers—scaling efficiently while reducing costs. Mariano breaks down Fluid’s architecture, its built-in reliability features, and how it redefines cloud computing for LLM-powered applications. Tune in to learn how Fluid could reshape the industry and what it means for developers.

Transcript

Group by:

welcome today we are speaking with Mariano Fernandez staff project manager for cicd and compute ADV verel for the last decad back end have been optimized for Speed but in the era of lln the Paradigm is Shifting slower is better it's proven that the longer llm think the better are the results and in the context of serverless it means that while your backend is waiting for an

from the llm you are paying for idle time which can result in huge bills that's why we sell built fluid in this episode we'll discuss the technical implementation of fluid some of this built-in reability mechanism what it mean for developers and the future for our industry let's jump in so Mariano can you share about you know the problem that come with I would say traditional server L when comp when you are running

AI workloads yeah sure I think like the main issue that we are facing nowadays is that the workloads have changed we are used to these kind of workloads that you will run on your sever lless functions that are just you know quick queries to database quick queries to a back end something like that we have optimized around that about quick response times but when we are moving into the AI world into the AI workloads

now we have functions that will take a long time to respond and sometimes you want them to take a long time you want your agents to be able to play you want your agents to be able to reason you have those long AI inference chain so what's going on with serverless while you are waiting for that response to happen you still have to pay for the machine that you are allocating while you're not using the CPU same happens when you start streaming a response back

and that was something that was making people get a bit away from serverless just imagine this you start getting a lot of requests to your functions right you start executing them under the traditional Ser you will start spawning one instance per invocation that you are making now let's say your invocation has I don't know an initial 3 milliseconds of execution that basically it's just start the function and then makes a call

to your llm to open a API to whatever you are hitting and then you have to wait for I don't know five six seven I don't know maybe you are doing some image generation you might have 30 seconds waiting all of that time you are paying for that you are paying a flat rate for that and you are not using the instance that you are paying for now

multiply it for the amount of invocations you have and it will start compounding you will start generating like this massive BS that was something that we realized would not scale that would not make it and we start seeing like some companies were starting to move back into different approaches like traditional servers where you can just have your BPS running and just smash it with request and while one request is waiting another request can take that machine and run Etc so we came up with

fluid that fluid ends up being the best from servers the best from serverless you have this reuse of the instances now each instance of your traditional serverless work as a server itself that can handle multiple invocations and at the same time you have the whole flexibility and scale to infinity and to zero that you have with servers if you don't have any request going on you are not paying for it that's the

28 segments (grouped from 304 original)2157 words~11 min readGrouped by 30s intervals

Video Details

Duration
13:53
Published
March 6, 2025
Channel
Rootly
Language
ENGLISH
Views
70
Likes
1