Storyteller: Building Multi-modal Apps with TS & ModelFusion - Lars Grammel, PhD | Vercel Insights Hub

Vercel Insights Hub

Storyteller: Building Multi-modal Apps with TS & ModelFusion - Lars Grammel, PhD | Vercel Insights Hub

Description

Developing a sophisticated AI application often involves integrating diverse models and tools. Using ModelFusion, this complexity can be streamlined. In this talk, Lars Grammel will take attendees through the journey of creating a kids' story generator, an example of the depth and richness achievable with ModelFusion in TypeScript. The story generator listens to a spoken description, crafts a story with unique characters and images, and then brings it to life with synthesized narration. Through this hands-on example, attendees will discover ModelFusion's key features: type inference, comprehensive control, integrated support tools, and its capability to go beyond mere text models. For AI developers keen on diving into practical applications of AI in TypeScript, this talk offers both inspiration and actionable insights. Recorded & streamed live for the AI Engineer Summit 2023. See the full schedule of talks at https://ai.engineer/summit/schedule & join us at the AI Engineer World's Fair in 2024! Get your tickets today at https://ai.engineer/worlds-fair About Lars Lars Grammel, PhD, is an AI engineer with a passion for simplifying the integration of AI models into applications. He earned his doctorate at the University of Victoria, British Columbia, where he focused on helping novices create data visualizations. Lars has an impressive professional background, including seven years at Trifacta, where he held various engineering roles and was the tech lead of the Cloud Dataprep project in collaboration with Google. Following this, he spent two years as a self-employed developer, creating refactoring tooling for JavaScript in Visual Studio Code. More recently, Lars has spent a year working on AI engineering projects, such as the RubberDuck plugin for Visual Studio Code and various agent prototypes. He is the creator of the ModelFusion JavaScript library, designed to make working with AI models more accessible. You can follow Lars on X at @lgrammel to stay updated on his work and projects.

Transcript

Group by:

[Music] hey everyone I'm presenting Storyteller na for generating short audio stories for preschool kids Storyteller is implemented using typescript and model Fusion in AI orchestration library that I've been developing it generates audio stories that are about two minutes long and all

it needs is a voice input here's an example of the kind of story it generates to give you an idea one day while they were playing Benny noticed something strange the forest wasn't as vibrant as before the leaves were turning brown and the Animals seemed less cheerful worried Benny asked his friends what was wrong friends why do the trees look so sad and why are you all so quiet today Benny the forest is in trouble the trees are dying and we don't know what to do how how

does this work let's dive into the details of the Storyteller application Storyteller is a client server application the client is written using react and the server is a custom fastify implementation the main challenges were responsiveness meaning getting results to the user as quickly as possible uh quality and consistency so when you start Storyteller it's just a small screen that has a record topic button and once

you start pressing it it starts recording um the audio when you release gets sent to the server as a buffer and there we transcribe it for transcription I'm using open AI whisper um it is really quick for a short topic 1.5 seconds and once it becomes available an event goes back to the client so the client server communication Works through an event stream server andent events that are being sent back

the event arrives on the client and the react State updates updating the screen okay so then the user knows something is going on in parallel I start generating the story outline for this I use gpt3 turbo instruct which I found to be very fast so it can generate a story outline in about 4 seconds and once we have that we can start a bunch of other tasks in parallel generating the title generating

the image and generating and narrating the audio story all happen in parallel I'll go through those one by one now first the title is generated for this open AI gpt3 turbo instruct is used again giving a really quick result once the title is available it's being sent to the client again as an event and rendered there in parallel the image generation runs first uh there needs to be a prompt

to actually generate the image and here consistency is important so we pass in the whole story into a gp4 prompt that then extracts relevant representative keywords for an image prompt from the story that image prompt is passed into stability AI stable diffusion Excel where an image is generated the generated image is stored as a virtual file in the server and then

an event is sent to the client with a path to that file the client can then through a regular URL request just retrieve the image as part of an image tag and it shows up in the UI generating the full audio story is the most timeconsuming piece of the puzzle here we have a complex promt that takes in the story and creates a structure with

15 segments (grouped from 163 original)1061 words~5 min readGrouped by 30s intervals