Powered by RND
PodcastsTechnologieLatent Space: The AI Engineer Podcast
Luister naar Latent Space: The AI Engineer Podcast in de app
Luister naar Latent Space: The AI Engineer Podcast in de app
(2.067)(250 021)
Favorieten opslaan
Wekker
Slaaptimer

Latent Space: The AI Engineer Podcast

Podcast Latent Space: The AI Engineer Podcast
swyx + Alessio
The podcast by and for AI Engineers! In 2023, over 1 million visitors came to Latent Space to hear about news, papers and interviews in Software 3.0. We cover ...

Beschikbare afleveringen

5 van 117
  • The Inventors of Deep Research
    The free livestreams for AI Engineer Summit are now up! Please hit the bell to help us appease the algo gods. We’re also announcing a special Online Track later today.Today’s Deep Research episode is our last in our series of AIE Summit preview podcasts - thanks for following along with our OpenAI, Portkey, Pydantic, Bee, and Bret Taylor episodes, and we hope you enjoy the Summit! Catch you on livestream.Everybody’s going deep now. Deep Work. Deep Learning. DeepMind. If 2025 is the Year of Agents, then the 2020s are the Decade of Deep.While “LLM-powered Search” is as old as Perplexity and SearchGPT, and open source projects like GPTResearcher and clones like OpenDeepResearch exist, the difference with “Deep Research” products is they are both “agentic” (loosely meaning that an LLM decides the next step in a workflow, usually involving tools) and bundling custom-tuned frontier models (custom tuned o3 and Gemini 1.5 Flash).The reception to OpenAI’s Deep Research agent has been nothing short of breathless:"Deep Research is the best public-facing AI product Google has ever released. It's like having a college-educated researcher in your pocket." - Jason Calacanis“I have had [Deep Research] write a number of ten-page papers for me, each of them outstanding. I think of the quality as comparable to having a good PhD-level research assistant, and sending that person away with a task for a week or two, or maybe more. Except Deep Research does the work in five or six minutes.” - Tyler Cowen“Deep Research is one of the best bargains in technology.” - Ben Thompson“my very approximate vibe is that it can do a single-digit percentage of all economically valuable tasks in the world, which is a wild milestone.” - sama“Using Deep Research over the past few weeks has been my own personal AGI moment. It takes 10 mins to generate accurate and thorough competitive and market research (with sources) that previously used to take me at least 3 hours.” - OAI employee“It's like a bazooka for the curious mind” - Dan Shipper“Deep research can be seen as a new interface for the internet, in addition to being an incredible agent… This paradigm will be so powerful that in the future, navigating the internet manually via a browser will be "old-school", like performing arithmetic calculations by hand.” - Jason Wei“One notable characteristic of Deep Research is its extreme patience. I think this is rapidly approaching “superhuman patience”. One realization working on this project was that intelligence and patience go really well together.” - HyungWon“I asked it to write a reference Interaction Calculus evaluator in Haskell. A few exchanges later, it gave me a complete file, including a parser, an evaluator, O(1) interactions and everything. The file compiled, and worked on my test inputs. There are some minor issues, but it is mostly correct. So, in about 30 minutes, o3 performed a job that would take me a day or so.” - Victor Taelin“Can confirm OpenAI Deep Research is quite strong. In a few minutes it did what used to take a dozen hours. The implications to knowledge work is going to be quite profound when you just ask an AI Agent to perform full tasks for you and come back with a finished result.” - Aaron Levie“Deep Research is genuinely useful” - Gary MarcusWith the advent of “Deep Research” agents, we are now routinely asking models to go through 100+ websites and generate in-depth reports on any topic. The Deep Research revolution has hit the AI scene in the last 2 weeks: * Dec 11th: Gemini Deep Research (today’s guest!) rolls out with Gemini Advanced* Feb 2nd: OpenAI releases Deep Research* Feb 3rd: a dozen “Open Deep Research” clones launch* Feb 5th: Gemini 2.0 Flash GA* Feb 15th: Perplexity launches Deep Research * Feb 17th: xAI launches Deep SearchIn today’s episode, we welcome Aarush Selvan and Mukund Sridhar, the lead PM and tech lead for Gemini Deep Research, the originators of the entire category. We asked detailed questions from inspiration to implementation, why they had to finetune a special model for it instead of using the standard Gemini model, how to run evals for them, and how to think about the distribution of use cases. (We also have an upcoming Gemini 2 episode with our returning first guest Logan Kilpatrick so stay tuned 👀)Two Kinds of Inference Time ComputeIn just ~2 months since NeurIPS, we’ve moved from “scaling has hit a wall, LLMs might be over” to “is this AGI already?” thanks to the releases of o1, o3, and DeepSeek R1 (see our o3 post and R1 distillation lightning pod). This new jump in capabilities is now accelerating many other applications; you might remember how “needle in a haystack” was one of the benchmarks people often referenced when looking at model’s capabilities over long context (see our 1M Llama context window ep for more). It seems that we have broken through the “wall” by scaling “inference time” in two meaningful ways — one with more time spent in the model, and the other with more tool calls.Both help build better agents which are clearly more intelligent. But as we discuss on the podcast, we are currently in a “honeymoon” period of agent products where taking more time (or tool calls, or search results) is considered good, because 1) quality is hard to evaluate and 2) we don’t know the realistic upper bound to quality. We know that they’re correlated, but we don’t know to what extent and if the correlation breaks down over extended research periods (they may not).It doesn’t take a PhD to spot the perverse incentives here.Agent UX: From Sync to Async to HybridWe also discussed the technical challenges in moving from a synchronous “chat” paradigm to the “async” world where every agent builder needs to handroll their own orchestration framework in the background.For now, most simple, first-cut implementations including Gemini and OpenAI and Bolt tend to make “locking” async experiences — while the report is generating or the plan is being executed, you can’t continue chatting with the model or editing the plan. In this case we think the OG Agent here is Devin (now GA), which has gotten it right from the beginning.Full Episode on YouTubewith demo!Show Notes* Deep Research* Aarush Selvan* Mukund Sridhar* NotebookLM episode (Raiza / Usama)* Bolt* Bret TaylorChapters* [00:00:00] Introductions* [00:00:22] Overview + Demo of Deep Research* [00:04:31] Editable chain of thought* [00:08:18] Search ranking for sources* [00:09:31] Can you DIY Deep Research?* [00:15:52] UX and research plan editing* [00:16:21] Follow-up queries and context retention* [00:21:06] Evaluating Deep Research* [00:28:06] Ontology of use cases and research patterns* [00:32:56] User perceptions of latency in Deep Research* [00:40:59] Lessons from other AI products* [00:42:12] Multimodal capabilities* [00:45:02] Technical challenges in Deep Research* [00:51:56] Can Deep Research discover new insights?* [00:54:11] Open challenges in agents* [00:57:04] Wrap upTranscriptAlessio [00:00:04]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI.Swyx [00:00:13]: Hey, and today we're very honored to have in our studio Aarush and Mukund from the Deep Research team, the OG Deep Research team. Welcome.Aarush [00:00:20]: Thanks for having us.Swyx [00:00:22]: Yeah, thanks for making the trip up. I was fortunate enough to be one of the early beta testers of Deep Research when he came out. I would say I was very keen on, I think even at the end of last year, people were already saying it was one of the most exciting agents that was coming out of Google. You know that previously we had on Ryza and Usama from the Novoca LM team. And I think this is an increasing trend that Gemini and Google are shipping interesting user-facing products that use AI. So congrats on your success so far. Yeah, it's been great. Thanks so much for having us here. Yeah. Yeah, thanks for making the trip up. And I'm also excited for your talk that is happening next week. Obviously, we have to talk about what exactly it is, but I'll ask you towards the end. So basically, okay, you know, we have the screen up. Maybe we just start at a high level for people who don't yet know. Like, what is Deep Research? Sure. Aarush [00:01:10]: So Deep Research is a feature where Gemini can act as your personal research assistant to help you learn about any topic that you want more deeply. It's really helpful for those queries. So you want to go from zero to 50 really fast on a new thing. And the way it works is it takes your query, browses the web for about five minutes, and then outputs a research report for you to review and ask follow-up questions. This is one of the first times, you know, something takes about five, six minutes trying to perform your research. So there's a few challenges that brings. Like, you want to make sure you're spending that time in the computer doing what the user wants. So there's some ways of the UX design that we can talk about. As we go through an example, and then there's also challenges in the browsers, the web is super fragmented and being able to plan iteratively and as, as you pass through this noisy information is a challenge by itself.Swyx [00:02:11]: Yeah. This is like the first time sort of Google automating yourself as searching, like you're, you know, you're supposed to be the experts at search, but now you're like meta-searching and like determining the search strategy.Aarush [00:02:22]: Yeah, I think, at least we see it as two different use cases. There are things that, you know, you know exactly what you're looking for and there's a search is still probably, you know, a very, you know, probably one of the best places to go. I think when deep research really shines is there like multiple facets to your question and you spend like a weekend, you know, just opening like 50, 60 tabs and many times I just give up and we wanted to solve that problem and, and give a great starting point for those kinds of journeys.Alessio [00:02:53]: Do we want to start a query so that it runs in the meantime and then we can chat over it?Swyx [00:02:58]: Okay, here's one query that, that we like, we love to test like super niche, random things, like things where there's like no Wikipedia page already about this topic or something like that, right? Because that's where you'll see the most lift from, from a feature like this. So for this one, I've come, I've come, come up with this query. This is actually Mokun's query that he's, he loves to test is help me understand how milk and meat regulations differ between the US and Europe. What's nice is the first step is actually where it puts together a research plan. That you can review. And so this is sort of its guide for how it's going to go about and carry out the research, right? And so this was like a pretty decently well-specified query, but like, let's say you came to Gemini and we're like, tell me about batteries, right? That query, you could mean so many different things. You might want to know about the like latest innovations in battery tech. You might want to know about like a specific type of battery chemistry. And if we're going to spend like five to even 10 minutes researching something, we want to one, understand. What exactly are you trying to accomplish here and to give you an opportunity, like to steer where the research goes, right? Because like, if you had an intern and you ask them this question, the first thing they do is ask you like a bunch of follow-up questions and be like, okay, so like, help me figure out exactly what you want me to do. And so the way we approached it is, we thought like, why don't we just have the model produce its first stab at the, at the research query at, at how it would break this down. And then invite the user to come and kind of engage with how they would want to steer this. Yeah.Editable chain of thoughtAarush [00:04:31]: And many times when you try to use a product like this, you often don't know what questions to look for or the things to look for. So we kind of made this decision very deliberately that instead of asking the users just follow-up questions directly, we kind of lay out, hey, this is what I would do. Like, these are the different facets. For example, here it could be like what additives are allowed and how that differs or labeling. Uh, restrictions and so on in products. The aim of this is to kind of tell the user about the topic a little bit more and also get steer. At the same time, we elicit for like, uh, you know, a follow-up question and so on. So we kind of did that in a joint question.Swyx [00:05:09]: It's kind of like editable chain of thought. Right. Exactly. Exactly. Yeah. I think that, you know, we were talking to you about like your top tips for using deep research. Yeah. Your number one tip is to edit the page. Just edit it. Right. So like we actually, you can actually edit conversationally. We put in a button here just to like draw users' attention to the fact that you can edit. Oh, actually you don't need to click the button. You don't even need to click the button. Yeah. Actually, like in early rounds of testing, we saw no one was editing. And so we were just like, if we just put a button here, maybe people will like. I confess I just hit start a lot. I think like we see that too. Like most people hit start. Um, like it's like the, I'm feeling lucky. Yeah. Yeah. All right. So like I, I can just add a, add a step here and what you'll see is it should like refine the plan and show you a new thing to propose. Here we go. So it's added step seven, find information and milk and meat labeling requirements in the US and the EU, or you can just go ahead and hit start. I think it's still like a nice transparency mechanism. Even if users don't want to engage, like you still kind of know, okay, here's at least an understanding of why I'm getting the report I'm going to get, um, which is kind of nice. And then while it browses the web and Morgan, you should maybe explain kind of how it, how it browses. We show kind of the, the websites it's reading in real time. Yeah. I'll preface this with, I haven't, I forgot to explain the rules. You're a PM and you're a tech lead. Yes. Okay. Yeah.Aarush [00:06:29]: Just for people who, who don't know, we maybe should have started with that. I suppose. Yeah. Yeah. We do each other's work sometimes as well, but more or less that's the boundary. Yeah. Yeah. Um, yeah. So, so what's happening behind the scenes actually is we kind of give this research plan that is a contract and that, uh, you know, has been accepted, but then if you look at the plan, there are things that are obviously parallelizable, so the model figures out which of the sub steps that it can start exploring in parallel, and then it primarily uses like two tools. It has the ability to perform searches and it has abilities to go deeper within, you know, a particular webpage of interest, right? And oftentimes it'll start exploring things in parallel, but that's not sufficient. Many times it, it has to reason based on information found. So in this case, it, one of the searches could have led the EU commission has these additives, and it wants to go and check if the FDA does the same thing, right? So, uh, this notion of being able to read outputs from the previous turn, uh, ground on that to decide what to do next, I think was, was key. Otherwise you have like incomplete information and your report becomes a little bit of a, like a high level, uh, bullet points. So we wanted to go beyond that blueprint and actually figure out, you know, what are the key aspects here. So, yeah. So the, this happens iteratively until the model thinks it's finished. All its steps. And then we kind of entered this, uh, analysis mode and here there can be inconsistencies across sources. You kind of come up with an outline for the report, start generating a draft. The model tries to revise that by self critiquing itself, uh, you know, to find out to finalize the prompt, uh, finalize the report. And that's probably what's happening behind the scenes.Search ranking for sourcesAlessio [00:08:18]: What's the initial ranking of the websites? So when you first started it, there were 36. How do you decide where to start since it sounds like, you know, the initial websites kind of carry a lot of weight too, because then they inform the following. Yes.Aarush [00:08:32]: So what happens in the initial terms, again, this is not like a, it's not something we enforce. It's mostly the model making these choices. But typically we see the model exploring all the different aspects in the, in the research plan that was presented. So we kind of get like a breadth first idea of what are the different topics to explore. And in terms of which ones to double click. I think it really comes down to every time you search the model, get some idea of what the pages and then depending on what pieces of it, sometimes there's inconsistency. Sometimes there's just like partial information. Those are the ones that double clicks on and, uh, yeah, it can continually like iteratively search and browse until it feels like it's done. Yeah.Swyx [00:09:15]: I'm trying to think about how I would code this. Um, a simple question would be like, do you think that we could do this with the Gemini API? Or do you have some special access that we cannot replicate? You know, like is, if I model this with a so-called of like search, double click, whatever. Yeah.Aarush [00:09:31]: I don't think we have special access per se. It's pretty much the same model. We of course have our own, uh, post-training work that we do. And y'all can also like, you know, you can fine tune from the base model and so on. Uh, I don't know that we can do it.Swyx [00:09:45]: I don't know how to fine tuning.Aarush [00:09:47]: Well, if you use our Gemma open source models, uh, you could fine tune. Yeah. I don't think there's a special access per se, but a lot of the work for us is first defining these, oh, there needs to be a research plan and, and how do you go about presenting that? And then, uh, a bunch of post-training to make sure, you know, it's able to do this consistently well and, uh, with, with high reliability and power. Okay.Swyx [00:10:09]: So, so 1.5 pro with deep research is a special edition of 1.5 pro. Yes.Aarush [00:10:14]: Right.Swyx [00:10:14]: So it's not pure 1.5 pro. It's, it's, it's, it's a post-training version. This also explains why you haven't just, you can't just toggle on 2.0 flash and just, yeah. Right. Yeah. But I mean, I, I assume you have the data and you know, it's should be doable. Yup. There's still this like question of ranking. Yeah. Right. And like, oh, it looks like you're, you're already done. Yeah. Yeah. We're done. Okay. We can look at it. Yeah. So let's see. It's put together this report and what it's done is it's sort of broken, started with like milk regulation and then it looks like it goes into meat probably further down and then sort of covering how the U.S. approaches this problem of like how to regulate milk. Comparing and then, you know, covering the EU and then, yeah, like I said, like going into the meat production and then it'll also, what's nice is it kind of reasons over like why are there differences? And I think what's really cool here is like, it's, it's showing that there's like a difference in philosophy between how the U.S. and the EU regulate food. So the EU like adopts a precautionary approach. So even if there's inconclusive scientific evidence about something, it's still going to prefer to like ban it. Whereas the U.S. takes sort of the reactive approach where it's like allowing things until they can be proven to be harmful. Right. So like, this is kind of nice is that you, you also like get the second order insights from what it's being put, what it's putting together. So yeah, it's, it's kind of nice. It takes a few minutes to read and like understand everything, which makes for like a quiet period doing a podcast, I suppose. But yeah, this is, this is kind of how it, how it looks right now. Yeah.Alessio [00:11:47]: And then from here you can kind of keep the usual chat and iterate. So this is more, if you were to like, you know, compared to other platforms, it's kind of like a Anthropic Artifact or like a ChatGPT canvas where like you have the document on one side and like the chat on the other and you're working on it.Aarush: [00:12:04]: Yeah. This is something we thought a bit about. And one of the things we feel is like your learning journey shouldn't just stop after the first report. And so actually what you probably want to do is while reading, be able to ask follow-up questions without having to scroll back and forth. And there's like broadly. A few different kinds of follow-up questions. One type is like, maybe there's like a factoid that you want that isn't in here, but it's probably been already captured as part of the web browsing that it did. Right. So we actually keep everything in context, like all the sites that it's read remain in context. So if there's a piece of missing information, it can just fetch that. Then another kind is like, okay, this is nice, but you actually want to kick off more deep research. Or like, I also want to compare the EU and Asia. Let's say in how they regulate milk and meat for that. You'd actually want the model to be like, okay, this is sufficiently different that I want to go do more deep research to answer this question. I won't find this information in what I've already browsed. And the third is actually, maybe you just want to like change the report. Like maybe you want to like condense it, remove sections, add sections, and actually like iterate on the report that you got. So we broadly are basically trying to teach the model to be able to do all three and the kind of side-by-side format allows sort of for the user to do that more easily. Yeah.Alessio [00:13:24]: So as a PM, there's a open in docs button there, right? Yeah. How do you think about what you're supposed to build in here versus kind of sounds like the condensing and things should be a Google docs. Yeah.Aarush [00:13:35]: It's just like an amazing editor. Like sometimes you just want to direct edit things and now Google docs also has Gemini in the side panel. So the more we can kind of help this be part of your workflow throughout the rest of the Google ecosystem, the better, right? Like, and one thing that we've noticed is people really like that button and really like exporting it. It's also a nice way to just save it permanently. And when you do export all the citations, and in fact, I can just run it now, carry over, which is also really nice. Gemini extensions is a different feature. So that is really around Gemini being able to fetch content from other Google services in order to inform the answer. So that was actually the first feature that we both worked on on the team as well. It was actually building extensions in Gemini. And so I think right now we have a bunch of different Google apps as well as I think Spotify and a couple, I don't know if we have, and Samsung apps as well. Who wants Spotify? I have this whole thing about like who wants Spotify? Who wants that in their deep research? In deep research, I think less, but like the interesting thing is like we built extensions and we didn't, we weren't really sure how people were going to use it. And a ton of people are doing really creative things with them. And a ton of people are just doing things that they loved on the Google assistant. And Spotify is like a huge, like playing music on the go was like a huge, a huge value. Oh, it controls Spotify? Yeah. It's not deep research. For deep research. Yeah. Purely use. Yeah. But this is search. Otherwise, yeah. Like you can, you can have Gemini go. Yeah. You have YouTube maps and search for flash thinking experimental with apps. The newest. Yeah. Longest model name that has been launched. But like, yeah, I think Gmail is obvious one. Yeah. The calendar is obvious one. Exactly. Those I want. Yeah. Spotify. Yeah. Fair enough. Yeah. And obviously feel free to dive in on your other work. I know you're, you're not just doing deep research, right? But you know, we're just kind of focusing on, on deep research here. I actually have asked for modifications after this first run where I was like, oh, you, you stopped. Like, I actually want you to keep going. Like what about these other things? And then continue to modify it. So it really felt like a little bit of a co-pilot type experience, but more like an experience. Yeah, we're just that much more than an agent that would be research. I thought it was pretty cool.UX and research plan editingAarush [00:15:52]: Yeah. One of the challenges is currently we kind of let the model decide based on your query amongst the three categories. So some, there is, there is a boundary there. Like some of these things, depending on how deep you want to go, you might just want to quite g thermometer versus like kick off another deeper search. And even from a UX perspective, I think the, the panel allows for this notion of, you know, not every fall up is going to take you. Like five minutes. Right.Swyx [00:16:17]: Right now, it doesn't do any follow-up. Does it do follow-up search? It always does?Aarush [00:16:21]: It depends on your question. Since we have the liberty of really long context models, we actually hold all the research material across dance. So if it's able to find the answer in things that it's found, you're going to get a faster reply. Yeah. Otherwise, it's just going to go back to planning.Swyx [00:16:38]: Yeah, yeah. A bit of a follow-up on the, since you brought up context, I had two questions. One, do you have a HTML to markdown transform step? Or do you just consume raw HTML? There's no way you consume raw HTML, right?Aarush [00:16:50]: We have both versions, right? So there is, the models are getting, like every generation of models are getting much better at native understanding of these representations. I think the markdown step definitely helps in terms of, you know, there's a lot of noise, like as you can imagine with the pure HTML. JavaScript, WinCSS. Exactly. So yeah, when it makes sense to do it, we don't artificially try to make it hard for the model. But sometimes it depends on the kind of access of what we get as well. Like, for example, if there's an embedded snippet that's HTML, we want the model to, you know, to be able to work on that as well.Swyx [00:17:27]: And no vision yet, but. Currently no vision, yes. The reason I ask all these things is because I've done the same. Got it. Like I haven't done vision.Aarush [00:17:36]: Yeah. So the tricky thing about vision is I think the models are getting significantly better, especially if you look at the last six months, natively being able to do like VQA stuff, and so on. But the challenge is the trade-off between having to, you know, actually render it and so on. The gap, the trade-off between the added latency versus the value add you get.Swyx [00:17:57]: You have a latency budget of minutes. Yeah, yeah, yeah.Aarush [00:18:01]: It's true. In my opinion, the places you'll see a real difference is like, I don't know, a small part of the tail, especially in like this kind of an open domain setting. If you just look at what people ask, there's definitely some use cases where it makes a lot of sense. But I still feel it's not in the head cases. And we'll do it when we get there.Swyx [00:18:23]: The classic is like, it's a JPEG that has some important information and you can't touch it. Okay. And then the other technical follow-up was just, you have 1 million to 2 million token context. Has it ever exceeded 2 million? And what do you do there? Yeah.Aarush [00:18:39]: So we had this challenge sometime last year where we said, when we started like wiring up this multi-turn, where we said, hey, we're going to do this. Hey, let's see how long somebody in the team can take DR, you know? Yeah.Swyx [00:18:51]: What's the most challenging question you can ask that takes the longest? Yeah. No, we also keep asking follow-ups.Aarush [00:18:55]: Like for example, here you could say, hey, I also want to compare it with like how it's Okay.Swyx [00:19:00]: So you're guaranteed to bust it. Yeah.Aarush [00:19:02]: Yeah. We also have, we have retrieval mechanisms if required. So we natively try to use the context as much as it's available beyond which, you know, we have like a rack set up to figure. Okay.Alessio [00:19:16]: This is all in-house, in-house tech. Yes. Okay.Aarush [00:19:19]: Yes.Alessio [00:19:19]: What are some of the differences between putting things in context versus rag? And when I was in Singapore, I went to the Google cloud team and they talk about Gemini plus grounding is Gemini plus search kind of like Gemini plus grounding or like, how should people think about the different shades of like, I'm doing retrieval and data versus I'm using deep research versus I'm using grounding. Sometimes the labels can be different. Sometimes it can be hard too.Aarush [00:19:46]: Yeah. I can, let me try to answer the first part of the question. Uh, the, the second part, I'm not fully sure of, of the grounding offering. So, uh, uh, when I can at least, at least talk about the first part of the question. So I think, uh, you're asking like the difference between like being able to, when you, when would you do a rag versus rely on the long contact?Alessio [00:20:06]: I think we all, we all get that. I was more curious, like from a product perspective, when you decide to do a rag versus s**t like this, you didn't need to, you know? Yeah. Do you get better performance just putting everything in context or?Aarush [00:20:18]: So the tricky thing for rag, it really works well because a lot of these things are doing like cosine distance, like a dot product kind of a thing. And that kind of gets challenging when your query side has multiple different attributes. Uh, the dot product doesn't really work as well. I would say, at least for me, that's, that's my guiding principle on, uh, when to avoid rag. That's one. The second one is, I think every generation. Of these models are, uh, like the initial generations, even though they offered like long context, that performance as the context kept growing was, you would see some kind of a decline, but I think, uh, as the newer generation models came out, uh, they were really good. Even if you kept filling in the context in being able to piece out, uh, like these really fine-grained information.Evaluating Deep ResearchSwyx [00:21:06]: So I think these two, at least for me, are like guiding principles on when to. Just to add to that. I think like, just like a simple rule of thumb that we use. Is like, if it's the most recent set of research tasks where the user is likely to ask lots of follow-up questions that should be in context, but like as stuff gets 10 tasks ago, you know, it's fine. If that stuff is in rag, because it's less likely that the user needs to do, you need to do like very complex comparisons between what's currently being discussed and the stuff that you asked about, you know, 10 turns ago. Right. So that's just like a, a very, like the rule of thumb that we follow. Yeah.Alessio [00:21:44]: So from a user perspective, is it better to just start a new research instead of like extending the context? Yeah.Aarush [00:21:50]: I think that's a good question. I think if it's a related topic, I think there's benefit to continue with this thread, uh, because you could, the model, since it has this in memory could figure out, oh, I've found this niche thing, uh, about, uh, I don't know, milk regulation in this case in the U S let me check if you're in a follow-up country or place also has something like that. So these kinds of things you might have not caught up. But if you start a new thread. So I think it really depends on, on the use case, if there's a natural progression, uh, and you feel like this is like part of one cohesive kind of a project, you should just continue using it. My follow-up is going to be like, oh, I'm just going to look for summer camps or something then. Yeah. I don't think it should make a difference, but we haven't really, uh, you know, pushed that to, uh, and, and, and tested that, that aspect of it for us. Most of our tests are like more natural transitions. Yeah.Swyx [00:22:40]: How do you eval deep research? Oh boy.Aarush [00:22:43]: Uh, yeah. This is a hard one. I think the entropy of the output space is so high, like it's, uh, like people love auto raters, but it brings its own, own, own set of, uh, challenges. And so for us, we have some metrics that we can auto generate, right? So for example, as we move, uh, when we do post-training and have multiple, uh, models, we kind of want to make sure, uh, the distribution of like certain stats, like for example, how long is spent on planning? How many, how many iterative steps it does on like some dev set, if you see large changes in distribution, that's, that's kind of like a early, uh, signal of, of something has changed. It could be for better or worse. Uh, so we have some metrics like that, that we can auto compute. So every time you have a new version, you run it across a test suite of cases and you see how long it takes. Yeah. So we have like a dev set and we have like some kind of automatic metrics that we can detect in terms of like the behavior end to end. Like for example, how long is the research plan? Do we, do we have like a, do we have like a, do we have like a, do we have like a, do we have like a new model is like a new model, produce really longer, many more steps, number of characters, like number of steps in case of the plan in the plans, it could be like, like we spoke about how it iteratively plans based on like previous searches, how many steps does that go on an average or some dev set. So there are some things like this you can automate, but beyond that, there are all generators, but we definitely do a lot of human evals and that we have defined with product about certain things we care about. I've been super opinionated about, is it comprehensive, is it complete, like groundedness and these kind of things. So it's a mix of these two attributes. There's another challenge, but I'll...Swyx [00:24:26]: Is this where, the other challenge in that, sometimes you just have to have your PM review examples. Yeah, exactly.Aarush [00:24:34]: Yeah, and for latency... So you're the human reader. But broadly, what we tried to do is, for the eval question, is like, we tried to think about like, what are all the ways in which a person might use a feature like this? And we came up with what we call an ontology of use cases. Yes. And really what we tried to do is like, stay away from like verticals, like travel or shopping and things like that. But really try and go into like, what is the underlying research behavior type that a person is doing? So... Yeah. There's queries on one end that are just, you're going very broad, but shallow, right? Things like, shopping queries are an example of that, or like, I want to find the perfect summer camp, my kids love soccer and tennis. And really, you just want to find as many different options and explore all the different options that are available, and then synthesize, okay, what's the TLDR about each one? Kind of like those journeys where you open many, many Chrome tabs, but then like, need to take notes somewhere of the stuff that's appealing. On the other end of the spectrum... You know, you've got like, a specific topic, and you just want to go super deep on that and really, really understand that. And there's like, all sorts of points in the middle, right? Around like, okay, I have a few options, but I want to compare them, or like, yeah, I want to go not super deep on a topic, but I want to cover a slightly, slightly more topics. And so we sort of developed this ontology of different research patterns, and then for each one came up with queries that would fall within that, and then that's sort of the eval set, by way of saying, okay, what's the TLDR about each one? Which we then run human evals on, and make sure we're kind of doing well across the board on all of those. Yeah, you mentioned three things. Is it literally three, or is it three out of like, 20 things? How wide is the ontology? I basically just told the... The full set? Yeah, I told, no, no, no, I told you the like, extremes, right? Extremes, okay. Yeah, and then we had like, several midpoints. So basically, yeah, going from like, something super broad and shallow to something very specific and deep. We weren't actually sure which end of the spectrum users are going to really resonate with. And then on top of that, you have compounds of those, right? So you can have things where you want to make a plan, right? Like, a great one is like, I want to plan a wedding in, you know, Lisbon, and I, you know, I need you to help with like, these 10 things, right? And so... Oh, that becomes like a project with research enabled... Right. And so then it needs to research planners, and venues, and catering, right? And so there's sort of compounds of when you start combining these different underlying ontology types. And so that, we also thought about that when we... When we tried to put together our eval set. Swyx: What's the maximum conversation length that you allow or design for? Aarush: We don't have any hard limits on the... How many turns you can do. One thing I will say is most users don't go very deep right now. Yeah. It might just be that it takes a while to get comfortable. And then over time, you start pushing it further and further. But like, right now, we don't see a ton of users. I think the way that you visually present it suggests that you stop when the doc is created. Right. So you don't... You don't actually really encourage... The UI doesn't encourage ongoing chats as though it was like a project. Right. I think there's definitely some things we can do on the UX side to basically invite the user to be like, Hey, this is the starting point. Now let's keep going together. Like, where else would you like to explore? So I think there's definitely some explorations we could do there. I think the... In terms of sort of how deep... I don't know. We've seen people internally just really push this thing. Yeah. To quite...Ontology of use cases and research patternsAarush [00:28:06]: I think the other thing I think will change with time is people kind of uncovering different ways to use deep research as well. Like for the wedding planning thing, for example. It's not one of the, you know, first thing that comes to mind when we tell people about this product. So that's another thing I think as people explore and find that this can do these various different kinds of things. Some of this can naturally lead to longer conversations. And even for us, right? When we dogfooded this, we saw people use it in, like, ways we hadn't really thought of before. So that was because this was, like, a little new. Like, we didn't know, like, will users wait for five minutes? What kind of tasks will... Are they, you know, going to try for something like that takes five minutes? So our primary goal was not to specialize in a particular vertical or target one type of user. We just wanted to put this in the hands of, like... Like, we had, like... This busy parent persona and, like, various different user profiles and see, like, what people try to use it for and learn more from that.Alessio [00:29:11]: And how does the ontology of the DR use case tie back to, like, the Google main product use cases? So you mentioned shopping as one ontology, right? There's also Google Shopping. Yeah. To me, this sounds like a much better way to do shopping than going on Google Shopping and looking at the wall of items. How do you collaborate internally to figure out where AI goes?Swyx [00:29:32]: Yeah, that's a great question. So when I meant, like, shopping, I sort of tried to boil down underneath what exactly is the behavior. And that's really around, like, I called it, like, options exploration. Like, you just want to be able to see. And whether you're shopping for summer camps or shopping for a product or shopping for, like, scholarship opportunities, it's sort of the same action of just, like, I need to curate from a large... Like, I need to sift through a lot of information to curate a bunch of options for me. So that's kind of what we tried to distill down rather than, like, thinking about it. It was a vertical. But yeah, Google Search is, like, awesome if you want to have really fast answers. You've got high intent for, like, I know exactly what I want. And you want, like, super up-to-date information, right? And I still do kind of like Google Shop because it's, like, multimodal. You see the best prices and stuff like that. I think creating a good shopping experience is hard, especially, like, when you need to look at the thing. If I'm shopping for shoes and, like, I don't want to use deep research because I want... I don't want to look at how the shoes look. But if I'm shopping for, like, HVAC systems, great. Like, I don't care how it looks or I don't even know what it's supposed to look like. And I'm fine using deep research because I really want to understand the specs and, like, how exactly does this work and the voltage rating and stuff like that, right? So, like, and I need to also look at contractors who know how to install each HVAC system. So I would say, like, where we really shine when it comes to shopping is those... That kind of end of the spectrum of, like, it's more complex and it matters less what it... Like, it's maybe less on the consumery side of shopping. One thing I've also observed just about the, I guess, the metrics or, like, the communication of what value you provide. And also this goes into the latency budget, is that I think there's a perverse incentive for research agents to take longer and be perceived to be better. People are like, oh, you're searching, like, 70 websites for me, you know, but, like, 30 of them are irrelevant, you know? Like, I feel like right now we're in kind of a honeymoon phase where you get a pass for all this. But being inefficient is actually good for you because, you know, people just care about quantity and not quality, right? So they're like, oh, this thing took an hour for me, like, it's doing so much work, like, or it's slow. That was super counterintuitive for us. So actually, the first time I realized that, what you're saying is when I was talking to Jason Calacanis and he was like, do you actually just make the answer in 10 seconds and just make me wait for the balance? Yeah. Which we hadn't expected. That people would actually value the, like, work that it's putting in because... You were actually worried about it. We were really worried about it. We were like, I remember, we actually built two versions of deep research. We had, like, a hardcore mode that takes, like, 15 minutes. And then what we actually shipped is a thing that takes five minutes. And I even went to Eng and I was like, there has to be a hard stop, by the way. It can never take more than 10 minutes. Yep. Because I think at that point, like, users will just drop off. Nope. But what's been surprising is, like, that's not the case at all. And it's been going the other way. Because when we worked on Assistant, at least, and other Google products, the metric has always been, if you improve latency, like, all the other metrics go up. Like, satisfaction goes up, retention goes up, all of that, right? And so when we pitch this, it's like, hold on. In contrast to, like, all Google orthodoxy, we're actually going to slow everything right down. And we're going to hope that, like, users still stay... Not on purpose.User perceptions of latency in Deep ResearchAarush [00:32:56]: Not on purpose. Yeah, I think it comes down to the trade-off. Like, what are you getting in return? For the wait. And from an engineering-slash-modeling perspective, it's just trading off entrance, compute, and time to do two things, right? Either to explore more, to be, like, more complete, or to verify more on things that you probably know already. And since it's like a spectrum, and we don't claim to have found the perfect spot, we had to start somewhere. And we're trying to see where... Like, there's probably some cases where you actually care about verifying more. More than the others. In an ideal world, based on the query and conversation history, you know what that is. So I think, yeah, it basically boils down to these three things. From a user perspective, am I getting the right value add? From an engineering-slash-modeling perspective, are we using the compute to either explore effectively and also verify and go in-depth for things that are vague or uncertain in the initial steps? The other point about the more number of websites, I think, again, it comes down to the number of websites. Sometimes you want to explore more early on before you kind of narrow down on either the sources or the topics you want to go deep. So that's one of the... If you look at, like, the way, at least for most queries, the way deep research works here is initially it'll go broad. If you look at the kinds of websites, it's time to explore all the different topics that we measured in the research plan. And then you would see choices of websites getting a little bit narrower on a particular topic or a particular topic. So that's roughly how the number kind of fluctuates. So we don't do anything deliberate to either keep it low or, you know, try to...Swyx [00:34:44]: Would it be interesting to have an explicit toggle for amount of verification versus amount of search? I think so. I think, like, users would always just hit that toggle. I worry that, like... Max everything. Yeah, if you, like, give a max power button, users will always... You're just going to hit that button, right? So then the question comes, like, why don't you just decide from the product POV, where's the right balance? OpenAI has a preview of this, like... I think it's either Anthropic or OpenAI, and there's a preview of this model routing feature where you can choose intelligence, cheapness, and speed. But then they're all zero to one values. So then you just choose one for everything. Obviously, they're going to, like, do a normalization thing. But users are always going to want one, right?Aarush [00:35:30]: We've discussed this a bit. Like, if I wear my pure user hat, I don't want to say anything. Like, I come with a query, you figure it out. Like, sometimes I feel like there will be, based on the query... Like, for example, right? If I'm asking about, hey, how does rising rates from the Fed house old income for a middle class? And how has it traditionally happened? These kind of things, you want to be very accurate. And you want to be very precise on historical trends of this, and so on, and so on. Whereas there is... There's a little bit more leeway when you're saying, hey, I'm trying to find businesses near me to go celebrate my birthday or something like that. So in an ideal world, we kind of figure that trade-off based on the conversation history and the topic. I don't think we're there yet as a research community. And it's an interesting challenge by itself.Swyx [00:36:20]: So this reminds me a little bit of the notebook LM approach. Raiza, who also asked this thing to Raiza, and she was like, yeah, just people want to click a button and see magic. Yeah. Like you said, you just hit start every time, right? You don't, most people don't even want to add up the plan. So, okay. My feedback on this, if you want feedback, is that I am still kind of a champion for Devin. In a sense that Devin will show you the plan while it's working the plan. And you can say like, hey, the plan is wrong. And I can chat with it while it's still working. And you live update the plan and then pick off the next item on the plan. I think it's static, right? Like while you're working on a plan, I cannot chat. It's just normal. Bolt also has this, like, you know, that's the most default experience, but I think you should never lock the chat. You should always be able to chat with the plan and update the plan and the plan scheduler, whatever orchestration system you have under the hood should just pick off the next job on the list. That'll be my two cents. Especially if we spend more time researching, right? Cause like right now, if you watch that query we just did, it was done within a few minutes. So your chance, your opportunity to chime in was actually like, or it left the research phase after a few minutes. So your opportunity to chime in. To chime in and steer was less, but especially imagine you could imagine a world where these things take an hour, right? And you're doing something really complicated. Then yeah, like your intern would totally come check in with you. Be like, here's what I found. Here's like some hiccups I'm running into the plan. Give me some steer on how to change that or how to change direction. And you would, you would do that with them. So I totally would see, especially as these tasks get longer, we actually want the user to come engage way more to like create a good output. I guess Devin had to do this because some of these jobs like take hours. Right. So, yeah. And it's pervasive since it's where they charge by hour. Oh, so they make more money, the slower they are. Interesting. Have we thought about that before?Swyx [00:38:14]: I'm calling this out because everyone is like, oh my God, it takes hours for, it does hours of work autonomously for me. And then they are like, okay, it's good. But like, this is a honeymoon phase. Like at some point we're going to say like, okay, but you know, it's very slow.Swyx [00:38:29]: Yeah. Anything else? Anything else that like, I mean, obviously within Google, you have a lot of other initiatives, you, I'm sure you like sit close to the Nopal Galem team in any learnings that are coming from shipping AI products in general. They're really awesome people. Like they're really nice, friendly thought, just like as people, I'm sure you met them, you like realize this with Razer and stuff. So like, they've actually been really, really cool collaborators or just like people to bounce ideas off. I think one thing I found really inspiring is they just picked a problem and hindsight's 2020. But like in advance, just like, Hey, we just want to build like the perfect IDE for you to do work and like be able to upload documents and ask questions about it and just make that really, really good. And I think we were definitely really inspired by their ability, their vision of just like, let's pick up a simple problem, really go after it, do it really, really well and have be opinionated about how it should work and just hope that users also resonate with that. And that's definitely something that we tried to learn from separately. They've also been really good at, you know, and maybe more. If you want to chime in here, just extracting the most out of Gemini 1.5 Pro, and they were really friendly about just like sharing their ideas about how to do that.Aarush [00:39:38]: Yeah, I think, I think you, you, you learn a bit, like when you're trying to do the last, last mile off of these products and, and, and, and pitfalls of, of any, any given model and so on. So, yeah, we definitely have a healthy relationship and, and, and share notes and like you're doing the same for other, other products.Swyx [00:39:54]: You'll never merge, right? It's just different teams. They are different teams. So they're in like labs as an organization that. So the mission of that is to really explore kind of different bets and, and explore what's possible. Even though I think there's a paid plan for Nopal Galem now. Yeah. So I think, and it's the same plan as us actually. So it's like, it's more than just the labs is what I'm saying. It's more than just labs. Cause I mean, yeah, ideally you want things to graduate and into, and stick around, but hopefully one thing we've done is, uh, like not created different skews, but just being like, Hey, if you pay the AI premium school, yeah, whatever. You get, you get everything, everything.Alessio [00:40:30]: What about learning from others? Obviously, I mean, open AI is deep research literally as the same name. I'm sure. Yeah. I'm sure there's a lot of, you know, contention. Is there anything you've learned from other people trying to build similar tools? Like, do you have opinions on maybe what people are getting wrong that they should do differently? It seems like from the outside, a lot of these products look the same. Ask for a research, get back a research, but obviously when you're building them, you understand the nuances a lot more.Lessons from other AI productsAarush [00:40:59]: When we built deep research, I think there was a few things that we took a few different bets, uh, around how this, how it should work. And what's nice is some of that is actually where we feel like was the right way to go. So we felt like agents should be transparent around telling you upfront, especially if they're going to take some time, what they're going to do. So that's really where that research plan, we showed that in a card, we really wanted to be very publisher forward in this product. So while it's browsing, we wanted to show you like all the websites. It's reading in real time, make it super easy for you to like double-click into those while it's browsing. And the third thing is, you know, putting it into a side-by-side artifacts so that you could ideally easy for you to read and ask at the same time. And what's nice is you kind of, as other products come around, you see some of these ideas also appearing in, in other iterations of this product. So I definitely see this as a space where like everyone in the industry is learning from each other, good ideas get reproduced and built upon. And so, yeah, we'll, we'll definitely keep iterating. And, and kind of following our users and seeing, seeing how we can make, make our future better. But yeah, I think, I think like it's, it's like, this is the way the industry works is like, everyone's going to kind of see good ideas and want to replicate and build off of it.Alessio [00:42:12]: And on the model side, OpenAI is the O3 model, which is not available through the API, the full one. Have you tried already with the two model? Like, is it a big jump or is a lot of the work on the post-training?Aarush [00:42:25]: Yeah, I would say stay tuned. Definitely. It currently is running on, on 1.5, the, the new generation models, especially with these thinking models, they unlock a few things. So I think one is obviously the better capability in like analytical thinking, like in math, coding, and these type of things, but also this notion of, you know, as they produce thoughts and think before taking actions, they kind of inherently have this notion of being able to critique them, the partial steps that they take and so on. So yeah, we definitely expect that. And then there is a little bit of the, the interesting part, and the interesting thing with we're exploring multiple different options to make better value for the, for our users as we, as we treat.Swyx [00:43:03]: I feel like there's a little bit of a conflation of inference time compute here in a sense of like, one, you can infer算 compute with the model, the thinking model. And then two, you can infersin compute by searching and reasoning. I wonder if there that gets in the way, like when you presumably, you've tested thinking, plus deep research, if the thinking actually does a little bit of verification. And then there's a little bit of thinking, plus deep research. Maybe it saves you some time or it like tries to draw too much from its internal knowledge and then therefore searches less, you know, like does it step on each other?Aarush [00:43:36]: Yeah, no, I think that's a, that's a really nice call out. And this also goes back to the kind of use case. The reason I bring that up is there are certain things that I can tell you from model memory last year, the Fed did X number of updates and so on. But unless I sourced it, it's going to be hallucinated. Yeah, like one is the hallucination or even if I got it right, as a user, I'd be very wary of that number unless I'm able to like source the .gov website for it and so on. Right. So that's another challenge. Like, there are things that you might not optimally spend time verifying, even though the models like, like, this is a very common fact the model already knows and it's able to like reason over and balancing that out between trying to leverage the model memory versus being able to ground this in, is in, you know, some kind of a source is the challenging part. And I think as, as like you rightly called out with the thinking models, this is even more pronounced because the models know more, they're able to like draw second order insights more just by reasoning over.Swyx [00:44:44]: Technically, they don't know more, they just use their internal knowledge more. Right?Aarush [00:44:48]: Yes, but also like, for example, things like math.Swyx [00:44:52]: I see, they've been, they've been post trained to do better math.Aarush [00:44:55]: Yeah, I think they just, they probably do way better job and in, like in, in that, so in that sense, they.Technical challenges in Deep ResearchSwyx [00:45:02]: Yeah, I mean, obviously reasoning is a topic of huge interest and people want to know what a engineering best practice is. Like, we think we know, like, you know, how to prompt them better, but engineering with them, I think also very, very unknown. Again, you guys are going to be the first to figure it out.Aarush [00:45:19]: Yeah, definitely interesting times and yeah. No pressure, Mokka. If you have tips, let us know.Swyx [00:45:25]: While we're on the sort of technical, elements and technical bend, I'm interested in like other parts of the deep research tech stack that might be worth calling out. Any hard problems that you solved just more generally?Aarush [00:45:37]: Yeah, I think the iterative planning one to do it in a generalizable way. Yeah, that was the thing I was most wary about. Like, you don't want to go down the route of being able to teach how to plan iteratively per domain or like per type of problem. Like, like even in the outgoing back to the ontology, if, if you had to teach them all. For every single type of ontology, how to come up with these traces of planning, that would have been a nightmarish. So trying to do that in a super data efficient way by, you know, leveraging a lot of like things, model memory, as well as like, there's this very tricky balance when you work on like, on the product side of any of these models is knowing how to post in it just enough without losing things that it knows in pre training, basically not overfitting in the most trivial sense, I guess. But yeah, so the techniques, their data augmentations there and multiple experiments to tune this trade off. I think that's, that's one of the challenges. Yeah.Swyx [00:46:37]: On the orchestration side, this is basically you're spinning up a job. I'm an orchestration nerd. So how do you do that?Aarush [00:46:43]: Is like a sub internal tool? Yeah, so we built this asynchronous platform for deep research, which is basically to like most of our interactions before this were like sync in nature. Like, yeah. Yeah.Swyx [00:46:56]: All the chat things are sync, right? Exactly. And now, now you can leave the chat and come back. Exactly.Aarush [00:47:01]: And close your computer. And now it's on Android and rolling out on iOS.Mukund [00:47:06]: So I saw you say that.Swyx [00:47:10]: I told you we switch it on sometimes. Okay.Mukund [00:47:13]: Like you're reminding him, right?Swyx [00:47:14]: Yeah, we wrapped on all Android phones and then iOS is this week. But yeah, what's, what's neat though, is like, you can close your computer, get a notification on your phone. Right. And so on. So it's some kind of e-sync engine that you made.Aarush [00:47:29]: Yes, yes. So we, the other one is this notion of synchronicity and the user able to leave. But also if you're, if you build like five, six minute jobs, they're bound to be like failures and you don't want to like lose your progress and so on. So this notion of like keeping state, knowing what to retry and kind of keep the journey going. Is there a public name for this or just some internal thing?Swyx [00:47:52]: No, I don't think there's a public name for this.Aarush [00:47:54]: Yeah.Swyx [00:47:54]: All right. Data scientists would be like, this is a Spark job or, you know, it's like a Wraith, you know, thing or whatever in the old Google days might be like MapReduce or, you know, whatever, but like it's, it's a different scale and nature of work than those things. So we just, I'm trying to find a name for this. And right now, this is our opportunity to name it. We can name it now. The classic name is I used to work in this area. This is what I'm asking. So it's, it's workflows. Nice. Yeah. Sort of durable workflows.Aarush [00:48:24]: Like back when you were in AWS. Temporal.Swyx [00:48:26]: So Apache Airflow, Temporal. You guys were both at Amazon, by the way. Yeah. AWS Step Functions would be one of those where you define a graph of execution, but Step Functions are more static and would not be as able to accommodate deep research style backends. What's neat though, is we built this to be like quite flexible. So it's like, you can imagine once you start doing hour or multi-day jobs. Yeah. You have to model what the agent wants to do. Exactly. And, but also like ensure like it's stable, you know, for, for me. Like hundreds of LLM calls. Yeah. It's boring, but like, you know, this is the thing that makes it run autonomously, you know? Right. Yeah. So like it's, yeah. Anyway, I'm excited about it. Just to close up the opening eye thing. I would say opening eye easily beat you on marketing. And I think it's because you don't launch your benchmarks. And my question to you is, should you care about benchmarks? Should you care about humanities last exam or not MMLU, but whatever. The like, I think benchmarks are great. Yeah. The thing we wanted to avoid is like the day Kobe Bryant entered the league, who was the president's nephew and like weird, like He's a big Kobe fan. Okay. Just like these like weird things that like nobody talks that way. So like, why would we over-solve for like some sort of a benchmark that doesn't necessarily represent the product experience we want to build. Nevertheless, like benchmarks are great for the industry and like rally a community and help us like understand where we're at. I don't know. Do you have any?Aarush [00:49:51]: No, I think you kind of hit the points. I think the, for us, our primary goal is like solving the deep research user value for the user use case. The benchmarks, at least the ones that we are seeing, they don't directly translate to the product. There's definitely some technical challenges that you can benchmark against, but they don't really like if I do great on HLE, that doesn't really mean I'm a great deep researcher. So we want to avoid that. We want to avoid going into that rabbit hole a bit. But we also feel like, yeah, benchmarks are great, especially in the whole gen AI space with like models coming every other day and everybody claiming to be like soda. So it's tricky. The other big challenge with benchmarks, especially when it comes to like the models these days, is the output space entropy is like everything is like a text. And so there's a notion of verifying even if you got the right answer, different labs do it in like different ways. And, but we all come back to it. We all compare numbers. So there's a lot of, you know, art slash figuring out like how you verify this or how you run this in a level plane. But yeah, so I think the straight offs is definitely value to doing benchmarks.Swyx [00:51:05]: But at the same time, we also like a selfish PM perspective. Benchmarks are a really great way to motivate researchers. Like make number go up. Exactly. Or just like prove you're the best. Like it's like a really good way of like rallying the researchers within your company. Like I used to work on the MLPerf benchmarks and like that was like, yeah, you'd put like a bunch of engineers in a room and in a few days they do like amazing performance improvements on our TPU stack and things like that. Right. So just like having a competitive nature and a pressure like really motivates people. There's one benchmark that is impossible to benchmark, but I just want to leave you with it, which is that deep research. Most people are chasing this idea of discovering new ideas. And deep research right now will summarize the web in a way that. Yeah. Is much more readable, but it won't. You know, what will it take to discover new things from the things that you've searched?Can Deep Research discover new insights?Aarush [00:51:56]: First, I think the thinking style models definitely help here because they are significantly better on how they reason natively and being able to draw these second order insights, which is like very premise. Like if you can't do that, you can't think of doing what you mentioned. So that's that's one step in. The other thing is. I think it also depends on the domain. So sometimes you can drift with a model for like new hypothesis, but depending on the domain, you might not be able to verify that hypothesis. Right. So like coding math, there are reasonably good tools that the model already knows to interact with. And you can run a verifier, test the hypothesis and so on. Like even if you think about it from a purely agent perspective saying, hey, I have this hypothesis in this area. Go figure out and come back to me. Right. But let's say you're a chemist. Right. So what are you going to do that? We don't have like synthetic environments yet where the model is able to verify these hypotheses by playing in a playground and have this like a very accurate verifier or a reward signal. The computer uses another one where there are both in the open source research and so on. There's like nice playgrounds coming up. So I think for if you're talking about truly being able to come up with my personal opinion is the model doesn't have to do the second order thinking. And so on that we're seeing now with these new models, but also be able to play and test that out in an environment where you can verify and give it feedback so that it can continue trading. Yeah.Swyx [00:53:28]: So basically like code sandboxes for now.Aarush [00:53:32]: Yeah. Yeah. So in those kind of cases, I think, yeah, it's a little bit more easy to envision this like end to end, but not for all domains. Physics engines. Yeah.Alessio [00:53:42]: So if you think about agents more broadly, there's like a lot of things. Right. That go into it. What do you think are like the most valuable pieces that people should be spending time on? Like things that come to mind that I'm seeing a lot of early stage companies is like memory, you know, like we already touched on evals. We touched a little bit on a tool call. There's kind of like the odd piece, like should this agent be able to access this? If yes, how do you verify that? What are things that you want more people to work on that will be helpful to you?Open challenges in agentsMukund [00:54:11]: I can take a stab at this from the lens of like deep research. Right. Like I think some of the things that we're really interested in in how we can push this agent are one like similar to memories, like personalization. Right. Like if I'm giving you a research report, the way I would give it to you if you're a 15 year old in high school should be totally different to the way I give it to you if you're like a PhD or postdoc. Right. You can prompt it. You can prompt it. Right. But the second thing, though, is like it should like ideally know where you're at and like everything, you know, up to that point. Right. And kind of further customized. Right. Have this understanding of like where you are in your learning journeys. I think modality will be also really interesting. Like right now we're text in, text out. We should go multimodal in. Right. But also multimodal out. Right. Like I would love if my reports are not just text, but like charts, maps, images, like make it super interactive and multimodal. Right. And optimized for the type of consumption. Right. So the way in which I might put together an academic paper should be totally different to the way I'm trying to do like a learning program for a kid. Right. And just the way it's structured. Ideally, like you want to do things with generative UI and things like that to really customize reports. I think those are definitely things that I'm personally interested when it comes to like a research agent. I think the other part that's super important is just like we will reach the limits of the open web and you want to be able to like a lot of the things that people care about are things that are in their own documents. Their own corpuses, things that are within subscriptions that they personally really care about. Like especially as you go more niche into specific industries. And ideally, you want ways for people to be able to complement their deep research experience with that content in order to further customize their answers.Aarush [00:55:56]: There's two answers to this. So one is I feel in terms of like the approach for us, at least for me, rather trying to figure out the core mission for like an agent building that. I feel like it's still early days for us. Like to try to platformatize or like try to build these. Oh, there are these five horizontal pieces and you can plug and play and build your own agent. My personal opinion is we are not there yet. In order to build a super engaging agent, I would if I were to start thinking of a new idea, I would I would start from the idea and try to just just do that one thing really well. Yes, at some point there will be a time where like these common pieces can be pulled out. And then. Yeah. And, you know, platformatized. I know there's a lot of work across companies and in the open source community about providing these tools to really build agents very easily. I think those are super useful to start building agents. But at some point, once those tools enable you to build the basic layers, I think me as an individual would would, you know, try to focus on really curating one experience before going super broad. Yeah.Alessio [00:57:04]: We have Bret Taylor from Sierra and he said they mostly built everything.Swyx [00:57:08]: Which is very sad for VCs.Aarush [00:57:10]: I want to find the next great framework and tooling and all that. But the space is moving so fast. Like, like the problem I described might be obsolete six months from now. And I don't know. Like, we'll fix it with one more LLM ops platform.Mukund [00:57:25]: Yes. Yes.Swyx [00:57:26]: Okay. So just just a final final point on just plugging your talk. People will be hearing this before your talk. What are you going to talk about? What are you looking forward to in New York? I would love to, like, actually learn from you guys. Like, what would you like us to do? Talk about now that we've had this conversation with you? Yeah. Yeah. What would what do you think people would find most interesting? I think a little bit of implementation and a little bit of vision, like kind of 50 50. And I think both of you can can sort of fill those roles very well. Everyone, you know, looks at you. You're very polished Google products. And I think Google always does does polish very well. But everyone will have to want to want like deep research for their industry. He's invested in deep research for finance. Yeah. And they focus on their their thing. And there will be deep researches for everything. Right. Like you have created a category here that OpenAI has cloned. And so, like, OK, let's let's talk about, like, what are the hard problems in this brand of agent that is probably the first real product market fit agent? I would say more so than the computer use ones. This is the one where, like, yeah, people are like easily pays for $200 worth a month worth of stuff, probably 2000 once you get it really good. So I'm like, OK, let's talk about like how to do this right from the people who did it. And then where is this going? So, yeah. Yeah. Yeah. It's very simple.Aarush [00:58:37]: Happy to talk about that.Swyx [00:58:39]: Yeah. Thank Yeah. For me as well. You know, I'm also curious to see you interact with the other speakers because then, you know, there will be other sort of agent problems. And I'm very interested in personalization. Very interested in memory. I think those are related problems. Planning, orchestration, all those things. Often security, something that we haven't talked about. There's a lot of the web that's behind off walls. Can I how do I delegate to you my credentials so that you can go and search the things that I have access to? I don't think it's that hard. You know, it's just, you know, people have to get their protocols together. And that's what conferences like that is hopefully meant to achieve. Yeah. Aarush: No, I'm super excited. I think for us, like it's we often like live and breathe within Google and which is like a really big place. But it's really nice to like take a step back. Meet people like approaching this problem at other companies or totally different industries. Right. Like inevitably, at least where we work, we're very consumer focused space. I see. Right. Yeah.Swyx: I'm more B2B. It's also really great to understand, like, OK, what's going on within the B2B space and like within different verticals. Yeah. The first thing they want to do is do research for my own docs. Right. My company docs. Yeah. So, yeah, obviously, you're going to get asked for that. Yeah. I mean, there'll be there'll be more to discuss. I'm really looking forward to your talk. And yeah. Thanks for joining us. Get full access to Latent.Space at www.latent.space/subscribe
    --------  
    1:01:58
  • Bee AI: The Wearable Ambient Agent
    Bundle tickets for AIE Summit NYC have now sold out. You can now sign up for the livestream — where we will be making a big announcement soon. NYC-based readers and Summit attendees should check out the meetups happening around the Summit.2024 was a very challenging year for AI Hardware. After the buzz of CES last January, 2024 was marked by the meteoric rise and even harder fall of AI Wearables companies like Rabbit and Humane, with an assist from a pre-wallpaper-app MKBHD. Even Friend.com, the first to launch in the AI pendant category, and which spurred Rewind AI to rebrand to Limitless and follow in their footsteps, ended up delaying their wearable ship date and launching an experimental website chatbot version. We have been cautiously excited about this category, keeping tabs on most of the top entrants, including Omi and Compass. However, to date the biggest winner still standing from the AI Wearable wars is Bee AI, founded by today's guests Maria and Ethan. Bee is an always on hardware device with beamforming microphones, 7 day battery life and a mute button, that can be worn as a wristwatch or a clip-on pin, backed by an incredible transcription, diarization and very long context memory processing pipeline that helps you to remember your day, your todos, and even perform actions by operating a virtual cloud phone. This is one of the most advanced, production ready, personal AI agents we've ever seen, so we were excited to be their first podcast appearance. We met Bee when we ran the world's first Personal AI meetup in April last year.As a user of Bee (and not an investor! just a friend!) it’s genuinely been a joy to use, and we were glad to take advantage of the opportunity to ask hard questions about the privacy and legal/ethical side of things as much as the AI and Hardware engineering side of Bee. We hope you enjoy the episode and tune in next Friday for Bee’s first conference talk: Building Perfect Memory.Full YouTube Video VersionWatch this for the live demo!Show Notes* Bee Website* Ethan Sutin, Maria de Lourdes Zollo* Bee @ Personal AI Meetup* Buy Bee with Listener Discount Code!Timestamps* 00:00:00 Introductions and overview of Bee Computer* 00:01:58 Personal context and use cases for Bee* 00:03:02 Origin story of Bee and the founders' background* 00:06:56 Evolution from app to hardware device* 00:09:54 Short-term value proposition for users* 00:12:17 Demo of Bee's functionality* 00:17:54 Hardware form factor considerations* 00:22:22 Privacy concerns and legal considerations* 00:30:57 User adoption and reactions to wearing Bee* 00:35:56 CES experience and hardware manufacturing challenges* 00:41:40 Software pipeline and inference costs* 00:53:38 Technical challenges in real-time processing* 00:57:46 Memory and personal context modeling* 01:02:45 Social aspects and agent-to-agent interactions* 01:04:34 Location sharing and personal data exchange* 01:05:11 Personality analysis capabilities* 01:06:29 Hiring and future of always-on AITranscriptAlessio [00:00:04]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of SmallAI.swyx [00:00:12]: Hey, and today we are very honored to have in the studio Maria and Ethan from Bee.Maria [00:00:16]: Hi, thank you for having us.swyx [00:00:20]: And you are, I think, the first hardware founders we've had on the podcast. I've been looking to have had a hardware founder, like a wearable hardware, like a wearable hardware founder for a while. I think we're going to have two or three of them this year. And you're the ones that I wear every day. So thank you for making Bee. Thank you for all the feedback and the usage. Yeah, you know, I've been a big fan. You are the speaker gift for the Engineering World's Fair. And let's start from the beginning. What is Bee Computer?Ethan [00:00:52]: Bee Computer is a personal AI system. So you can think of it as AI living alongside you in first person. So it can kind of capture your in real life. So with that understanding can help you in significant ways. You know, the obvious one is memory, but that's that's really just the base kind of use case. So recalling and reflective. I know, Swyx, that you you like the idea of journaling, but you don't but still have some some kind of reflective summary of what you experienced in real life. But it's also about just having like the whole context of a human being and understanding, you know, giving the machine the ability to understand, like, what's going on in your life. Your attitudes, your desires, specifics about your preferences, so that not only can it help you with recall, but then anything that you need it to do, it already knows, like, if you think about like somebody who you've worked with or lived with for a long time, they just know kind of without having to ask you what you would want, it's clear that like, that is the future that personal AI, like, it's just going to be very, you know, the AI is just so much more valuable with personal context.Maria [00:01:58]: I will say that one of the things that we are really passionate is really understanding this. Personal context, because we'll make the AI more useful. Think about like a best friend that know you so well. That's one of the things that we are seeing from the user. They're using from a companion standpoint or professional use cases. There are many ways to use B, but companionship and professional are the ones that we are seeing now more.swyx [00:02:22]: Yeah. It feels so dry to talk about use cases. Yeah. Yeah.Maria [00:02:26]: It's like really like investor question. Like, what kind of use case?Ethan [00:02:28]: We're just like, we've been so broken and trained. But I mean, on the base case, it's just like, don't you want your AI to know everything you've said and like everywhere you've been, like, wouldn't you want that?Maria [00:02:40]: Yeah. And don't stay there and repeat every time, like, oh, this is what I like. You already know that. And you do things for me based on that. That's I think is really cool.swyx [00:02:50]: Great. Do you want to jump into a demo? Do you have any other questions?Alessio [00:02:54]: I want to maybe just cover the origin story. Just how did you two meet? What was the was this the first idea you started working on? Was there something else before?Maria [00:03:02]: I can start. So Ethan and I, we know each other from six years now. He had a company called Squad. And before that was called Olabot and was a personal AI. Yeah, I should. So maybe you should start this one. But yeah, that's how I know Ethan. Like he was pivoting from personal AI to Squad. And there was a co-watching with friends product. I had experience working with TikTok and video content. So I had the pivoting and we launched Squad and was really successful. And at the end. The founders decided to sell that to Twitter, now X. So both of us, we joined X. We launched Twitter Spaces. We launched many other products. And yeah, till then, we basically continue to work together to the start of B.Ethan [00:03:46]: The interesting thing is like this isn't the first attempt at personal AI. In 2016, when I started my first company, it started out as a personal AI company. This is before Transformers, no BERT even like just RNNs. You couldn't really do any convincing dialogue at all. I met Esther, who was my previous co-founder. We both really interested in the idea of like having a machine kind of model or understand a dynamic human. We wanted to make personal AI. This was like more geared towards because we had obviously much limited tools, more geared towards like younger people. So I don't know if you remember in 2016, there was like a brief chatbot boom. It was way premature, but it was when Zuckerberg went up on F8 and yeah, M and like. Yeah. The messenger platform, people like, oh, bots are going to replace apps. It was like for about six months. And then everybody realized, man, these things are terrible and like they're not replacing apps. But it was at that time that we got excited and we're like, we tried to make this like, oh, teach the AI about you. So it was just an app that you kind of chatted with and it would ask you questions and then like give you some feedback.Maria [00:04:53]: But Hugging Face first version was launched at the same time. Yeah, we started it.Ethan [00:04:56]: We started out the same office as Hugging Face because Betaworks was our investor. So they had to think. They had a thing called Bot Camp. Betaworks is like a really cool VC because they invest in out there things. They're like way ahead of everybody else. And like back then it was they had something called Bot Camp. They took six companies and it was us and Hugging Face. And then I think the other four, I'm pretty sure, are dead. But and Hugging Face was the one that really got, you know, I mean, 30% success rate is pretty good. Yeah. But yeah, when we it was, it was like it was just the two founders. Yeah, they were kind of like an AI company in the beginning. It was a chat app for teenagers. A lot of people don't know that Hugging Face was like, hey, friend, how was school? Let's trade selfies. But then, you know, they built the Transformers library, I believe, to help them make their chat app better. And then they open sourced and it was like it blew up. And like they're like, oh, maybe this is the opportunity. And now they're Hugging Face. But anyway, like we were obsessed with it at that time. But then it was clear that there's some people who really love chatting and like answering questions. But it's like a lot of work, like just to kind of manually.Maria [00:06:00]: Yeah.Ethan [00:06:01]: Teach like all these things about you to an AI.Maria [00:06:04]: Yeah, there were some people that were super passionate, for example, teenagers. They really like, for example, to speak about themselves a lot. So they will reply to a lot of questions and speak about them. But most of the people, they don't really want to spend time.Ethan [00:06:18]: And, you know, it's hard to like really bring the value with it. We had like sentence similarity and stuff and could try and do, but it was like it was premature with the technology at the time. And so we pivoted. We went to YC and the long story, but like we pivoted to consumer video and that kind of went really viral and got a lot of usage quickly. And then we ended up selling it to Twitter, worked there and left before Elon, not related to Elon, but left Twitter.swyx [00:06:46]: And then I should mention this is the famous time when well, when when Elon was just came in, this was like Esther was the famous product manager who slept there.Ethan [00:06:56]: My co-founder, my former co-founder, she sleeping bag. She was the sleep where you were. Yeah, yeah, she stayed. We had left by that point.swyx [00:07:03]: She very stayed, she's famous for staying.Ethan [00:07:06]: Yeah, but later, later left or got, I think, laid off, laid off. Yeah, I think the whole product team got laid off. She was a product manager, director. But yeah, like we left before that. And then we're like, oh, my God, things are different now. You know, I think this is we really started working on again right before ChatGPT came out. But we had an app version and we kind of were trying different things around it. And then, you know, ultimately, it was clear that, like, there were some limitations we can go on, like a good question to ask any wearable company is like, why isn't this an app? Yes. Yeah. Because like.Maria [00:07:40]: Because we tried the app at the beginning.Ethan [00:07:43]: Yeah. Like the idea that it could be more of a and B comes from ambient. So like if it was more kind of just around you all the time and less about you having to go open the app and do the effort to, like, enter in data that led us down the path of hardware. Yeah. Because the sensors on this are microphones. So it's capturing and understanding audio. We started actually our first hardware with a vision component, too. And we can talk about why we're not doing that right now. But if you wanted to, like, have a continuous understanding of audio with your phone, it would monopolize your microphone. It would get interrupted by calls and you'd have to remember to turn it on. And like that little bit of friction is actually like a substantial barrier to, like, get your phone. It's like the experience of it just being with you all the time and like living alongside you. And so I think that that's like the key reason it's not an app. And in fact, we do have Apple Watch support. So anybody who has a watch, Apple Watch can use it right away without buying any hardware. Because we worked really hard to make a version for the watch that can run in the background, not super drain your battery. But even with the watch, there's still friction because you have to remember to turn it on and it still gets interrupted if somebody calls you. And you have to remember to. We send a notification, but you still have to go back and turn it on because it's just the way watchOS works.Maria [00:09:04]: One of the things that we are seeing from our Apple Watch users, like I love the Apple Watch integration. One of the things that we are seeing is that people, they start using it from Apple Watch and after a couple of days they buy the B because they just like to wear it.Ethan [00:09:17]: Yeah, we're seeing.Maria [00:09:18]: That's something that like they're learning and it's really cool. Yeah.Ethan [00:09:21]: I mean, I think like fundamentally we like to think that like a personal AI is like the mission. And it's more about like the understanding. Connecting the dots, making use of the data to provide some value. And the hardware is like the ears of the AI. It's not like integrating like the incoming sensor data. And that's really what we focus on. And like the hardware is, you know, if we can do it well and have a great experience on the Apple Watch like that, that's just great. I mean, but there's just some platform restrictions that like existing hardware makes it hard to provide that experience. Yeah.Alessio [00:09:54]: What do people do in like two or three days that then convinces them to buy it? They buy the product. This feels like a product where like after you use it for a while, you have enough data to start to get a lot of insights. But it sounds like maybe there's also like a short term.Maria [00:10:07]: From the Apple Watch users, I believe that because every time that you receive a call after, they need to go back to B and open it again. Or for example, every day they need to charge Apple Watch and reminds them to open the app every day. They feel like, okay, maybe this is too much work. I just want to wear the B and just keep it open and that's it. And I don't need to think about it.Ethan [00:10:27]: I think they see the kind of potential of it just from the watch. Because even if you wear it a day, like we send a summary notification at the end of the day about like just key things that happened to you in your day. And like I didn't even think like I'm not like a journaling type person or like because like, oh, I just live the day. Why do I need to like think about it? But like it's actually pretty sometimes I'm surprised how interesting it is to me just to kind of be like, oh, yeah, that and how it kind of fits together. And I think that's like just something people get immediately with the watch. But they're like, oh, I'd like an easier watch. I'd like a better way to do this.swyx [00:10:58]: It's surprising because I only know about the hardware. But I use the watch as like a backup for when I don't have the hardware. I feel like because now you're beamforming and all that, this is significantly better. Yeah, that's the other thing.Ethan [00:11:11]: We have way more control over like the Apple Watch. You're limited in like you can't set the gain. You can't change the sample rate. There's just very limited framework support for doing anything with audio. Whereas if you control it. Then you can kind of optimize it for your use case. The Apple Watch isn't meant to be kind of recording this. And we can talk when we get to the part about audio, why it's so hard. This is like audio on the hardest level because you don't know it has to work in all environments or you try and make it work as best as it can. Like this environment is very great. We're in a studio. But, you know, afterwards at dinner in a restaurant, it's totally different audio environment. And there's a lot of challenges with that. And having really good source audio helps. But then there's a lot more. But with the machine learning that still is, you know, has to be done to try and account because like you can tune something for one environment or another. But it'll make one good and one bad. And like making something that's flexible enough is really challenging.Alessio [00:12:10]: Do we want to do a demo just to set the stage? And then we kind of talk about.Maria [00:12:14]: Yeah, I think we can go like a walkthrough and the prod.Alessio [00:12:17]: Yeah, sure.swyx [00:12:17]: So I think we said I should. So for listeners, we'll be switching to video. That was superimposed on. And to this video, if you want to see it, go to our YouTube, like and subscribe as always. Yeah.Maria [00:12:31]: And by the bee. Yes.swyx [00:12:33]: And by the bee. While you wait. While you wait. Exactly. It doesn't take long.Maria [00:12:39]: Maybe you should have a discount code just for the listeners. Sure.swyx [00:12:43]: If you want to offer it, I'll take it. All right. Yeah. Well, discount code Swyx. Oh s**t. Okay. Yeah. There you go.Ethan [00:12:49]: An important thing to mention also is that the hardware is meant to work with the phone. And like, I think, you know, if you, if you look at rabbit or, or humane, they're trying to create like a new hardware platform. We think that the phone's just so dominant and it will be until we have the next generation, which is not going to be for five, you know, maybe some Orion type glasses that are cheap enough and like light enough. Like that's going to take a long time before with the phone rather than trying to just like replace it. So in the app, we have a summary of your days, but at the top, it's kind of what's going on now. And that's updating your phone. It's updating continuously. So right now it's saying, I'm discussing, you know, the development of, you know, personal AI, and that's just kind of the ongoing conversation. And then we give you a readable form. That's like little kind of segments of what's the important parts of the conversations. We do speaker identification, which is really important because you don't want your personal AI thinking you said something and attributing it to you when it was just somebody else in the conversation. So you can also teach it other people's voices. So like if some, you know, somebody close to you, so it can start to understand your relationships a little better. And then we do conversation end pointing, which is kind of like a task that didn't even exist before, like, cause nobody needed to do this. But like if you had somebody's whole day, how do you like break it into logical pieces? And so we use like not just voice activity, but other signals to try and split up because conversations are a little fuzzy. They can like lead into one, can start to the next. So also like the semantic content of it. When a conversation ends, we run it through larger models to try and get a better, you know, sense of the actual, what was said and then summarize it, provide key points. What was the general atmosphere and tone of the conversation and potential action items that might've come of that. But then at the end of the day, we give you like a summary of all your day and where you were and just kind of like a step-by-step walkthrough of what happened and what were the key points. That's kind of just like the base capture layer. So like if you just want to get a kind of glimpse or recall or reflect that's there. But really the key is like all of this is now like being influenced on to generate personal context about you. So we generate key items known to be true about you and that you can, you know, there's a human in the loop aspect is like you can, you have visibility. Right. Into that. And you can, you know, I have a lot of facts about technology because that's basically what I talk about all the time. Right. But I do have some hobbies that show up and then like, how do you put use to this context? So I kind of like measure my day now and just like, what is my token output of the day? You know, like, like as a human, how much information do I produce? And it's kind of measured in tokens and it turns out it's like around 200,000 or so a day. But so in the recall case, we have, um. A chat interface, but the key here is on the recall of it. Like, you know, how do you, you know, I probably have 50 million tokens of personal context and like how to make sense of that, make it useful. So I can ask simple, like, uh, recall questions, like details about the trip I was on to Taiwan, where recently we're with our manufacturer and, um, in real time, like it will, you know, it has various capabilities such as searching through your, your memories, but then also being able to search the web or look at my calendar, we have integrations with Gmail and calendars. So like connecting the dots between the in real life and the digital life. And, you know, I just asked it about my Taiwan trip and it kind of gives me the, the breakdown of the details, what happened, the issues we had around, you know, certain manufacturing problems and it, and it goes back and references the conversation so I can, I can go back to the source. Yeah.Maria [00:16:46]: Not just the conversation as well, the integrations. So we have as well Gmail and Google calendar. So if there is something there that was useful to have more context, we can see that.Ethan [00:16:56]: So like, and it can, I never use the word agentic cause it's, it's cringe, but like it can search through, you know, if I, if I'm brainstorming about something that spans across, like search through my conversation, search the email, look at the calendar and then depending on what's needed. Then synthesize, you know, something with all that context.Maria [00:17:18]: I love that you did the Spotify wrapped. That was pretty cool. Yeah.Ethan [00:17:22]: Like one thing I did was just like make a Spotify wrap for my 2024, like of my life. You can do that. Yeah, you can.Maria [00:17:28]: Wait. Yeah. I like those crazy.Ethan [00:17:31]: Make a Spotify wrapped for my life in 2024. Yeah. So it's like surprisingly good. Um, it like kind of like game metrics. So it was like you visited three countries, you shipped, you know, XMini, beta. Devices.Maria [00:17:46]: And that's kind of more personal insights and reflection points. Yeah.swyx [00:17:51]: That's fascinating. So that's the demo.Ethan [00:17:54]: Well, we have, we can show something that's in beta. I don't know if we want to do it. I don't know.Maria [00:17:58]: We want to show something. Do it.Ethan [00:18:00]: And then we can kind of fit. Yeah.Maria [00:18:01]: Yeah.Ethan [00:18:02]: So like the, the, the, the vision is also like, not just about like AI being with you in like just passively understanding you through living your experience, but also then like it proactively suggesting things to you. Yeah. Like at the appropriate time. So like not just pool, but, but kind of, it can step in and suggest things to you. So, you know, one integration we have that, uh, is in beta is with WhatsApp. Maria is asking for a recommendation for an Italian restaurant. Would you like me to look up some highly rated Italian restaurants nearby and send her a suggestion?Maria [00:18:34]: So what I did, I just sent to Ethan a message through WhatsApp in his own personal phone. Yeah.Ethan [00:18:41]: So, so basically. B is like watching all my incoming notifications. And if it meets two criteria, like, is it important enough for me to raise a suggestion to the user? And then is there something I could potentially help with? So this is where the actions come into place. So because Maria is my co-founder and because it was like a restaurant recommendation, something that it could probably help with, it proposed that to me. And then I can, through either the chat and we have another kind of push to talk walkie talkie style button. It's actually a multi-purpose button to like toggle it on or off, but also if you push to hold, you can talk. So I can say, yes, uh, find one and send it to her on WhatsApp is, uh, an Android cloud phone. So it's, uh, going to be able to, you know, that has access to all my accounts. So we're going to abstract this away and the execution environment is not really important, but like we can go into technically why Android is actually a pretty good one right now. But, you know, it's searching for Italian restaurants, you know, and we don't have to watch this. I could be, you know, have my ear AirPods in and in my pocket, you know, it's going to go to WhatsApp, going to find Maria's thread, send her the response and then, and then let us know. Oh my God.Alessio [00:19:56]: But what's the, I mean, an Italian restaurant. Yeah. What did it choose? What did it choose? It's easy to say. Real Italian is hard to play. Exactly.Ethan [00:20:04]: It's easy to say. So I doubt it. I don't know.swyx [00:20:06]: For the record, since you have the Italians, uh, best Italian restaurant in SF.Maria [00:20:09]: Oh my God. I still don't have one. What? No.Ethan [00:20:14]: I don't know. Successfully found and shared.Alessio [00:20:16]: Let's see. Let's see what the AI says. Bottega. Bottega? I think it's Bottega.Maria [00:20:21]: Have you been to Bottega? How is it?Alessio [00:20:24]: It's fine.Maria [00:20:25]: I've been to one called like Norcina, I think it was good.Alessio [00:20:29]: Bottega is on Valencia Street. It's fine. The pizza is not good.Maria [00:20:32]: It's not good.Alessio [00:20:33]: Some of the pastas are good.Maria [00:20:34]: You know, the people I'm sorry to interrupt. Sorry. But there is like this Delfina. Yeah. That here everybody's like, oh, Pizzeria Delfina is amazing. I'm overrated. This is not. I don't know. That's great. That's great.swyx [00:20:46]: The North Beach Cafe. That place you took us with Michele last time. Vega. Oh.Alessio [00:20:52]: The guy at Vega, Giuseppe, he's Italian. Which one is that? It's in Bernal Heights. Ugh. He's nice. He's not nice. I don't know that one. What's the name of the place? Vega. Vega. Vega. Cool. We got the name. Vega. But it's not Vega.Maria [00:21:02]: It's Italian. Whatswyx [00:21:10]: Vega. Vega.swyx [00:21:16]: Vega. Vega. Vega. Vega. Vega. Vega. Vega. Vega. Vega.Ethan [00:21:29]: Vega. Vega. Vega. Vega. Vega.Ethan [00:21:40]: We're going to see a lot of innovation around hardware and stuff, but I think the real core is being able to do something useful with the personal context. You always had the ability to capture everything, right? We've always had recorders, camcorders, body cameras, stuff like that. But what's different now is we can actually make sense and find the important parts in all of that context.swyx [00:22:04]: Yeah. So, and then one last thing, I'm just doing this for you, is you also have an API, which I think I'm the first developer against. Because I had to build my own. We need to hire a developer advocate. Or just hire AI engineers. The point is that you should be able to program your own assistant. And I tried OMI, the former friend, the knockoff friend, and then real friend doesn't have an API. And then Limitless also doesn't have an API. So I think it's very important to own your data. To be able to reprocess your audio, maybe. Although, by default, you do not store audio. And then also just to do any corrections. There's no way that my needs can be fully met by you. So I think the API is very important.Ethan [00:22:47]: Yeah. And I mean, I've always been a consumer of APIs in all my products.swyx [00:22:53]: We are API enjoyers in this house.Ethan [00:22:55]: Yeah. It's very frustrating when you have to go build a scraper. But yeah, it's for sure. Yeah.swyx [00:23:03]: So this whole combination of you have my location, my calendar, my inbox. It really is, for me, the sort of personal API.Alessio [00:23:10]: And is the API just to write into it or to have it take action on external systems?Ethan [00:23:16]: Yeah, we're expanding it. It's right now read-only. In the future, very soon, when the actions are more generally available, it'll be fully supported in the API.Alessio [00:23:27]: Nice. I'll buy one after the episode.Ethan [00:23:30]: The API thing, to me, is the most interesting. Yeah. We do have real-time APIs, so you can even connect a socket and connect it to whatever you want it to take actions with. Yeah. It's too smart for me.Alessio [00:23:43]: Yeah. I think when I look at these apps, and I mean, there's so many of these products, we launch, it's great that I can go on this app and do things. But most of my work and personal life is managed somewhere else. Yeah. So being able to plug into it. Integrate that. It's nice. I have a bunch of more, maybe, human questions. Sure. I think maybe people might have. One, is it good to have instant replay for any argument that you have? I can imagine arguing with my wife about something. And, you know, there's these commercials now where it's basically like two people arguing, and they're like, they can throw a flag, like in football, and have an instant replay of the conversation. I feel like this is similar, where it's almost like people cannot really argue anymore or, like, lie to each other. Because in a world in which everybody adopts this, I don't know if you thought about it. And also, like, how the lies. You know, all of us tell lies, right? How do you distinguish between when I'm, there's going to be sometimes things that contradict each other, because I might say something publicly, and I might think something, really, that I tell someone else. How do you handle that when you think about building a product like this?Maria [00:24:48]: I would say that I like the fact that B is an objective point of view. So I don't care too much about the lies, but I care more about the fact that can help me to understand what happened. Mm-hmm. And the emotions in a really objective way, like, really, like, critical and objective way. And if you think about humans, they have so many emotions. And sometimes something that happened to me, like, I don't know, I would feel, like, really upset about it or really angry or really emotional. But the AI doesn't have those emotions. It can read the conversation, understand what happened, and be objective. And I think the level of support is the one that I really like more. Instead of, like, oh, did this guy tell me a lie? I feel like that's not exactly, like, what I feel. I find it curious for me in terms of opportunity.Alessio [00:25:35]: Is the B going to interject in real time? Say I'm arguing with somebody. The B is like, hey, look, no, you're wrong. What? That person actually said.Ethan [00:25:43]: The proactivity is something we're very interested in. Maybe not for, like, specifically for, like, selling arguments, but more for, like, and I think that a lot of the challenge here is, you know, you need really good reasoning to kind of pull that off. Because you don't want it just constantly interjecting, because that would be super annoying. And you don't want it to miss things that it should be interjecting. So, like, it would be kind of a hard task even for a human to be, like, just come in at the right times when it's appropriate. Like, it would take the, you know, with the personal context, it's going to be a lot better. Because, like, if somebody knows about you, but even still, it requires really good reasoning to, like, not be too much or too little and just right.Maria [00:26:20]: And the second part about, well, like, some things, you know, you say something to somebody else, but after I change my mind, I send something. Like, it's every time I have, like, different type of conversation. And I'm like, oh, I want to know more about you. And I'm like, oh, I want to know more about you. I think that's something that I found really fascinating. One of the things that we are learning is that, indeed, humans, they evolve over time. So, for us, one of the challenges is actually understand, like, is this a real fact? Right. And so far, what we do is we give, you know, to the, we have the human in the loop that can say, like, yes, this is true, this is not. Or they can edit their own fact. For sure, in the future, we want to have all of that automatized inside of the product.Ethan [00:26:57]: But, I mean, I think your question kind of hits on, and I know that we'll talk about privacy, but also just, like, if you have some memory and you want to confirm it with somebody else, that's one thing. But it's for sure going to be true that in the future, like, not even that far into the future, that it's just going to be kind of normalized. And we're kind of in a transitional period now. And I think it's, like, one of the key things that is for us to kind of navigate that and make sure we're, like, thinking of all the consequences. And how to, you know, make the right choices in the way that everything's designed. And so, like, it's more beneficial than it could be harmful. But it's just too valuable for your AI to understand you. And so if it's, like, MetaRay bands or the Google Astra, I think it's just people are going to be more used to it. So people's behaviors and expectations will change. Whether that's, like, you know, something that is going to happen now or in five years, it's probably in that range. And so, like, I think we... We kind of adapt to new technologies all the time. Like, when the Ring cameras came out, that was kind of quite controversial. It's like... But now it's kind of... People just understand that a lot of people have cameras on their doors. And so I think that...Maria [00:28:09]: Yeah, we're in a transitional period for sure.swyx [00:28:12]: I will press on the privacy thing because that is the number one thing that everyone talks about. Obviously, I think in Silicon Valley, people are a little bit more tech-forward, experimental, whatever. But you want to go mainstream. You want to sell to consumers. And we have to worry about this stuff. Baseline question. The hardest version of this is law. There are one-party consent states where this is perfectly legal. Then there are two-party consent states where they're not. What have you come around to this on?Ethan [00:28:38]: Yeah, so the EU is a totally different regulatory environment. But in the U.S., it's basically on a state-by-state level. Like, in Nevada, it's single-party. In California, it's two-party. But it's kind of untested. You know, it's different laws, whether it's a phone call, whether it's in person. In a state like California, it's two-party. Like, anytime you're in public, there's no consent comes into play because the expectation of privacy is that you're in public. But we process the audio and nothing is persisted. And then it's summarized with the speaker identification focusing on the user. Now, it's kind of untested on a legal, and I'm not a lawyer, but does that constitute the same as, like, a recording? So, you know, it's kind of a gray area and untested in law right now. I think that the bigger question is, you know, because, like, if you had your Ray-Ban on and were recording, then you have a video of something that happened. And that's different than kind of having, like, an AI give you a summary that's focused on you that's not really capturing anybody's voice. You know, I think the bigger question is, regardless of the legal status, like, what is the ethical kind of situation with that? Because even in Nevada that we're—or many other U.S. states where you can record. Everything. And you don't have to have consent. Is it still, like, the right thing to do? The way we think about it is, is that, you know, we take a lot of precautions to kind of not capture personal information of people around. Both through the speaker identification, through the pipeline, and then the prompts, and the way we store the information to be kind of really focused on the user. Now, we know that's not going to, like, satisfy a lot of people. But I think if you do try it and wear it again. It's very hard for me to see anything, like, if somebody was wearing a bee around me that I would ever object that it captured about me as, like, a third party to it. And like I said, like, we're in this transitional period where the expectation will just be more normalized. That it's, like, an AI. It's not capturing, you know, a full audio recording of what you said. And it's—everything is fully geared towards helping the person kind of understand their state and providing valuable information to them. Not about, like, logging details about people they encounter.Alessio [00:30:57]: You know, I've had the same question also with the Zoom meeting transcribers thing. I think there's kind of, like, the personal impact that there's a Firefly's AI recorder. Yeah. I just know that it's being recorded. It's not like a—I don't know if I'm going to say anything different. But, like, intrinsically, you kind of feel—because it's not pervasive. And I'm curious, especially, like, in your investor meetings. Do people feel differently? Like, have you had people ask you to, like, turn it off? Like, in a business meeting, to not record? I'm curious if you've run into any of these behaviors.Maria [00:31:29]: You know what's funny? On my end, I wear it all the time. I take my coffee, a blue bottle with it. Or I work with it. Like, obviously, I work on it. So, I wear it all the time. And so far, I don't think anybody asked me to turn it off. I'm not sure if because they were really friendly with me that they know that I'm working on it. But nobody really cared.swyx [00:31:48]: It's because you live in SF.Maria [00:31:49]: Actually, I've been in Italy as well. Uh-huh. And in Italy, it's a super privacy concern. Like, Europe is a super privacy concern. And again, they're nothing. Like, it's—I don't know. Yeah. That, for me, was interesting.Ethan [00:32:01]: I think—yeah, nobody's ever asked me to turn it off, even after giving them full demos and disclosing. I think that some people have said, well, my—you know, in a personal relationship, my partner initially was, like, kind of uncomfortable about it. We heard that from a few users. And that was, like, more in just, like— It's not like a personal relationship situation. And the other big one is people are like, I do like it, but I cannot wear this at work. I guess. Yeah. Yeah. Because, like, I think I will get in trouble based on policies or, like, you know, if you're wearing it inside a research lab or something where you're working on things that are kind of sensitive that, like—you know, so we're adding certain features like geofencing, just, like, at this location. It's just never active.swyx [00:32:50]: I mean, I've often actually explained to it the other way, where maybe you only want it at work, so you never take it from work. And it's just a work device, just like your Zoom meeting recorder is a work device.Ethan [00:33:09]: Yeah, professionals have been a big early adopter segment. And you say in San Francisco, but we have out there our daily shipment of over 100. If you go look at the addresses, Texas, I think, is our biggest state, and Florida, just the biggest states. A lot of professionals who talk for, and we didn't go out to build it for that use case, but I think there is a lot of demand for white-collar people who talk for a living. And I think we're just starting to talk with them. I think they just want to be able to improve their performance around, understand what they were doing.Alessio [00:33:47]: How do you think about Gong.io? Some of these, for example, sales training thing, where you put on a sales call and then it coaches you. They're more verticalized versus having more horizontal platform.Ethan [00:33:58]: I am not super familiar with those things, because like I said, it was kind of a surprise to us. But I think that those are interesting. I've seen there's a bunch of them now, right? Yeah. It kind of makes sense. I'm terrible at sales, so I could probably use one. But it's not my job, fundamentally. But yeah, I think maybe it's, you know, we heard also people with restaurants, if they're able to understand, if they're doing well.Maria [00:34:26]: Yeah, but in general, I think a lot of people, they like to have the double check of, did I do this well? Or can you suggest me how I can do better? We had a user that was saying to us that he used for interviews. Yeah, he used job interviews. So he used B and after asked to the B, oh, actually, how do you think my interview went? What I should do better? And I like that. And like, oh, that's actually like a personal coach in a way.Alessio [00:34:50]: Yeah. But I guess the question is like, do you want to build all of those use cases? Or do you see B as more like a platform where somebody is going to build like, you know, the sales coach that connects to B so that you're kind of the data feed into it?Ethan [00:35:02]: I don't think this is like a data feed, more like an understanding kind of engine and like definitely. In the future, having third parties to the API and building out for all the different use cases is something that we want to do. But the like initial case we're trying to do is like build that layer for all that to work. And, you know, we're not trying to build all those verticals because no startup could do that well. But I think that it's really been quite fascinating to see, like, you know, I've done consumer for a long time. Consumer is very hard to predict, like, what's going to be. It's going to be like the thing that's the killer feature. And so, I mean, we really believe that it's the future, but we don't know like what exactly like process it will take to really gain mass adoption.swyx [00:35:50]: The killer consumer feature is whatever Nikita Beer does. Yeah. Social app for teens.Ethan [00:35:56]: Yeah, well, I like Nikita, but, you know, he's good at building bootstrap companies and getting them very viral. And then selling them and then they shut down.swyx [00:36:05]: Okay, so you just came back from CES.Maria [00:36:07]: Yeah, crazy. Yeah, tell us. It was my first time in Vegas and first time CES, both of them were overwhelming.swyx [00:36:15]: First of all, did you feel like you had to do it because you're in consumer hardware?Maria [00:36:19]: Then we decided to be there and to have a lot of partners and media meetings, but we didn't have our own booth. So we decided to just keep that. But we decided to be there and have a presence there, even just us and speak with people. It's very hard to stand out. Yeah, I think, you know, it depends what type of booth you have. I think if you can prepare like a really cool booth.Ethan [00:36:41]: Have you been to CES?Maria [00:36:42]: I think it can be pretty cool.Ethan [00:36:43]: It's massive. It's huge. It's like 80,000, 90,000 people across the Venetian and the convention center. And it's, to me, I always wanted to go just like...Maria [00:36:53]: Yeah, you were the one who was like...swyx [00:36:55]: I thought it was your idea.Ethan [00:36:57]: I always wanted to go just as a, like, just as a fan of...Maria [00:37:01]: Yeah, you wanted to go anyways.Ethan [00:37:02]: Because like, growing up, I think CES like kind of peaked for a while and it was like, oh, I want to go. That's where all the cool, like... gadgets, everything. Yeah, now it's like SmartBitch and like, you know, vacuuming the picks up socks. Exactly.Maria [00:37:13]: There are a lot of cool vacuums. Oh, they love it.swyx [00:37:15]: They love the Roombas, the pick up socks.Maria [00:37:16]: And pet tech. Yeah, yeah. And dog stuff.swyx [00:37:20]: Yeah, there's a lot of like robot stuff. New TVs, new cars that never ship. Yeah. Yeah. I'm thinking like last year, this time last year was when Rabbit and Humane launched at CES and Rabbit kind of won CES. And now this year, no wearables except for you guys.Ethan [00:37:32]: It's funny because it's obviously it's AI everything. Yeah. Like every single product. Yeah.Maria [00:37:37]: Toothbrush with AI, vacuums with AI. Yeah. Yeah.Ethan [00:37:41]: We like hair blow, literally a hairdryer with AI. We saw.Maria [00:37:45]: Yeah, that was cool.Ethan [00:37:46]: But I think that like, yeah, we didn't, another kind of difference like around our, like we didn't want to do like a big overhypey promised kind of Rabbit launch. Because I mean, they did, hats off to them, like on the presentation and everything, obviously. But like, you know, we want to let the product kind of speak for itself and like get it out there. And I think we were really happy. We got some very good interest from media and some of the partners there. So like it was, I think it was definitely worth going. I would say like if you're in hardware, it's just kind of how you make use of it. Like I think to do it like a big Rabbit style or to have a huge show on there, like you need to plan that six months in advance. And it's very expensive. But like if you, you know, go there, there's everybody's there. All the media is there. There's a lot of some pre-show events that it's just great to talk to people. And the industry also, all the manufacturers, suppliers are there. So we learned about some really cool stuff that we might like. We met with somebody. They have like thermal energy capture. And it's like, oh, could you maybe not need to charge it? Because they have like a thermal that can capture your body heat. And what? Yeah, they're here. They're actually here. And in Palo Alto, they have like a Fitbit thing that you don't have to charge.swyx [00:39:01]: Like on paper, that's the power you can get from that. What's the power draw for this thing?Ethan [00:39:05]: It's more than you could get from the body heat, it turns out. But it's quite small. I don't want to disclose technically. But I think that solar is still, they also have one where it's like this thing could be like the face of it. It's just a solar cell. And like that is more realistic. Or kinetic. Kinetic, apparently, I'm not an expert in this, but they seem to think it wouldn't be enough. Kinetic is quite small, I guess, on the capture.swyx [00:39:33]: Well, I mean, watch. Watchmakers have been powering with kinetic for a long time. Yeah. We don't have to talk about that. I just want to get a sense of CES. Would you do it again? I definitely would not. Okay. You're just a fan of CES. Business point of view doesn't make sense. I happen to be in the conference business, right? So I'm kind of just curious. Yeah.Maria [00:39:49]: So I would say as we did, so without the booth and really like straightforward conversations that were already planned. Three days. That's okay. I think it was okay. Okay. But if you need to invest for a booth that is not. Okay. A good one. Which is how much? I think.Ethan [00:40:06]: 10 by 10 is 5,000. But on top of that, you need to. And then they go like 10 by 10 is like super small. Yeah. And like some companies have, I think would probably be more in like the six figure range to get. And I mean, I think that, yeah, it's very noisy. We heard this, that it's very, very noisy. Like obviously if you're, everything is being launched there and like everything from cars to cell phones are being launched. Yeah. So it's hard to stand out. But like, I think going in with a plan of who you want to talk to, I feel like.Maria [00:40:36]: That was worth it.Ethan [00:40:37]: Worth it. We had a lot of really positive media coverage from it and we got the word out and like, so I think we accomplished what we wanted to do.swyx [00:40:46]: I mean, there's some world in which my conference is kind of the CES of whatever AI becomes. Yeah. I think that.Maria [00:40:52]: Don't do it in Vegas. Don't do it in Vegas. Yeah. Don't do it in Vegas. That's the only thing. I didn't really like Vegas. That's great. Amazing. Those are my favorite ones.Alessio [00:41:02]: You can not fit 90,000 people in SF. That's really duh.Ethan [00:41:05]: You need to do like multiple locations so you can do Moscone and then have one in.swyx [00:41:09]: I mean, that's what Salesforce conferences. Well, GDC is how many? That might be 50,000, right? Okay. Form factor, right? Like my way to introduce this idea was that I was at the launch in Solaris. What was the old name of it? Newton. Newton. Of Tab when Avi first launched it. He was like, I thought through everything. Every form factor, pendant is the thing. And then we got the pendants for this original. The first one was just pendants and I took it off and I forgot to put it back on. So you went through pendants, pin, bracelet now, and maybe there's sort of earphones in the future, but what was your iterations?Maria [00:41:49]: So we had, I believe now three or four iterations. And one of the things that we learned is indeed that people don't like the pendant. In particular, woman, you don't want to have like anything here on the chest because it's maybe you have like other necklace or any other stuff.Ethan [00:42:03]: You just ship a premium one that's gold. Yeah. We're talking some fashion reached out to us.Maria [00:42:11]: Some big fashion. There is something there.swyx [00:42:13]: This is where it helps to have an Italian on the team.Maria [00:42:15]: There is like some big Italian luxury. I can't say anything. So yeah, bracelet actually came from the community because they were like, oh, I don't want to wear anything like as necklace or as a pendant. Like it's. And also like the one that we had, I don't know if you remember, like it was like circle, like it was like this and was like really bulky. Like people didn't like it. And also, I mean, I actually, I don't dislike, like we were running fast when we did that. Like our, our thing was like, we wanted to ship them as soon as possible. So we're not overthinking the form factor or the material. We were just want to be out. But after the community organically, basically all of them were like, well, why you don't just don't do the bracelet? Like he's way better. I will just wear it. And that's it. So that's how we ended up with the bracelet, but it's still modular. So I still want to play around the father is modular and you can, you know, take it off and wear it as a clip or in the future, maybe we will bring back the pendant. But I like the fact that there is some personalization and right now we have two colors, yellow and black. Soon we will have other ones. So yeah, we can play a lot around that.Ethan [00:43:25]: I think the form factor. Like the goal is for it to be not super invasive. Right. And something that's easy. So I think in the future, smaller, thinner, not like apple type obsession with thinness, but it does matter like the, the size and weight. And we would love to have more context because that will help, but to make it work, I think it really needs to have good power consumption, good battery life. And, you know, like with the humane swapping the batteries, I have one, I mean, I'm, I'm, I think we've made, and there's like pretty incredible, some of the engineering they did, but like, it wasn't kind of geared towards solving the problem. It was just, it's too heavy. The swappable batteries is too much to man, like the heat, the thermals is like too much to light interface thing. Yeah. Like that. That's cool. It's cool. It's cool. But it's like, if, if you have your handout here, you want to use your phone, like it's not really solving a problem. Cause you know how to use your phone. It's got a brilliant display. You have to kind of learn how to gesture this low range. Yeah. It's like a resolution laser, but the laser is cool that the fact they got it working in that thing, even though if it did overheat, but like too heavy, too cumbersome, too complicated with the multiple batteries. So something that's power efficient, kind of thin, both in the physical sense and also in the edge compute kind of way so that it can be as unobtrusive as possible. Yeah.Maria [00:44:47]: Users really like, like, I like when they say yes, I like to wear it and forget about it because I don't need to charge it every single day. On the other version, I believe we had like 35 hours or something, which was okay. But people, they just prefer the seven days battery life and-swyx [00:45:03]: Oh, this is seven days? Yeah. Oh, I've been charging every three days.Maria [00:45:07]: Oh, no, you can like keep it like, yeah, it's like almost seven days.swyx [00:45:11]: The other thing that occurs to me, maybe there's an Apple watch strap so that I don't have to double watch. Yeah.Maria [00:45:17]: That's the other one that, yeah, I thought about it. I saw as well the ones that like, you can like put it like back on the phone. Like, you know- Plog. There is a lot.swyx [00:45:27]: So yeah, there's a competitor called Plog. Yeah. It's not really a competitor. They only transcribe, right? Yeah, they only transcribe. But they're very good at it. Yeah.Ethan [00:45:33]: No, they're great. Their hardware is really good too.swyx [00:45:36]: And they just launched the pin too. Yeah.Ethan [00:45:38]: I think that the MagSafe kind of form factor has a lot of advantages, but some disadvantages. You can definitely put a very huge battery on that, you know? And so like the battery life's not, the power consumption's not so much of a concern, but you know, downside the phone's like in your pocket. And so I think that, you know, form factors will continue to evolve, but, and you know, more sensors, less obtrusive and-Maria [00:46:02]: Yeah. We have a new version.Ethan [00:46:04]: Easier to use.Maria [00:46:05]: Okay.swyx [00:46:05]: Looking forward to that. Yeah. I mean, we'll, whenever we launch this, we'll try to show whatever, but I'm sure you're going to keep iterating. Last thing on hardware, and then we'll go on to the software side, because I think that's where you guys are also really, really strong. Vision. You wanted to talk about why no vision? Yeah.Ethan [00:46:20]: I think it comes down to like when you're, when you're a startup, especially in hardware, you're just, you work within the constraints, right? And so like vision is super useful and super interesting. And what we actually started with, there's two issues with vision that make it like not the place we decided to start. One is power consumption. So you know, you kind of have to trade off your power budget, like capturing even at a low frame rate and transmitting the radio is actually the thing that takes up the majority of the power. So. Yeah. So you would really have to have quite a, like unacceptably, like large and heavy battery to do it continuously all day. We have, I think, novel kind of alternative ways that might allow us to do that. And we have some prototypes. The other issue is form factor. So like even with like a wide field of view, if you're wearing something on your chest, it's going, you know, obviously the wrist is not really that much of an option. And if you're wearing it on your chest, it's, it's often gone. You're going to probably be not capturing like the field of view of what's interesting to you. So that leaves you kind of with your head and face. And then anything that goes on, on the face has to look cool. Like I don't know if you remember the spectacles, it was kind of like the first, yeah, but they kind of, they didn't, they were not very successful. And I think one of the reasons is they were, they're so weird looking. Yeah. The camera was so big on the side. And if you look at them at array bands where they're way more successful, they, they look almost indistinguishable from array bands. And they invested a lot into that and they, they have a partnership with Qualcomm to develop custom Silicon. They have a stake in Luxottica now. So like they coming from all the angles, like to make glasses, I think like, you know, I don't know if you know, Brilliant Labs, they're cool company, they make frames, which is kind of like a cool hackable glasses and, and, and like, they're really good, like on hardware, they're really good. But even if you look at the frames, which I would say is like the most advanced kind of startup. Yeah. Yeah. Yeah. There was one that launched at CES, but it's not shipping yet. Like one that you can buy now, it's still not something you'd wear every day and the battery life is super short. So I think just the challenge of doing vision right, like off the bat, like would require quite a bit more resources. And so like audio is such a good entry point and it's also the privacy around audio. If you, if you had images, that's like another huge challenge to overcome. So I think that. Ideally the personal AI would have, you know, all the senses and you know, we'll, we'll get there. Yeah. Okay.swyx [00:48:57]: One last hardware thing. I have to ask this because then we'll move to the software. Were either of you electrical engineering?Ethan [00:49:04]: No, I'm CES. And so I have a, I've taken some EE courses, but I, I had done prior to working on, on the hardware here, like I had done a little bit of like embedded systems, like very little firmware, but we have luckily on the team, somebody with deep experience. Yeah.swyx [00:49:21]: I'm just like, you know, like you have to become hardware people. Yeah.Ethan [00:49:25]: Yeah. I mean, I learned to worry about supply chain power. I think this is like radio.Maria [00:49:30]: There's so many things to learn.Ethan [00:49:32]: I would tell this about hardware, like, and I know it's been said before, but building a prototype and like learning how the electronics work and learning about firmware and developing, this is like, I think fun for a lot of engineers and it's, it's all totally like achievable, especially now, like with, with the tools we have, like stuff you might've been intimidated about. Like, how do I like write this firmware now? With Sonnet, like you can, you can get going and actually see results quickly. But I think going from prototype to actually making something manufactured is a enormous jump. And it's not all about technology, the supply chain, the procurement, the regulations, the cost, the tooling. The thing about software that I'm used to is it's funny that you can make changes all along the way and ship it. But like when you have to buy tooling for an enclosure that's expensive.swyx [00:50:24]: Do you buy your own tooling? You have to.Ethan [00:50:25]: Don't you just subcontract out to someone in China? Oh, no. Do we make the tooling? No, no. You have to have CNC and like a bunch of machines.Maria [00:50:31]: Like nobody makes their own tooling, but like you have to design this design and you submitEthan [00:50:36]: it and then they go four to six weeks later. Yeah. And then if there's a problem with it, well, then you're not, you're not making any, any of your enclosures. And so you have to really plan ahead. And like.swyx [00:50:48]: I just want to leave tips for other hardware founders. Like what resources or websites are most helpful in your sort of manufacturing journey?Ethan [00:50:55]: You know, I think it's different depending on like it's hardware so specialized in different ways.Maria [00:51:00]: I will say that, for example, I should choose a manufacturer company. I speak with other founders and like we can give you like some, you know, some tips of who is good and who is not, or like who's specialized in something versus somebody else. Yeah.Ethan [00:51:15]: Like some people are good in plastics. Some people are good.Maria [00:51:18]: I think like for us, it really helped at the beginning to speak with others and understand. Okay. Like who is around. I work in Shenzhen. I lived almost two years in China. I have an idea about like different hardware manufacturer and all of that. Soon I will go back to Shenzhen to check out. So I think it's good also to go in place and check.Ethan [00:51:40]: Yeah, you have to like once you, if you, so we did some stuff domestically and like if you have that ability. The reason I say ability is very expensive, but like to build out some proof of concepts and do field testing before you take it to a manufacturer, despite what people say, there's really good domestic manufacturing for small quantities at extremely high prices. So we got our first PCB and the assembly done in LA. So there's a lot of good because of the defense industry that can do quick churn. So it's like, we need this board. We need to find out if it's working. We have this deadline we want to start, but you need to go through this. And like if you want to have it done and fabricated in a week, they can do it for a price. But I think, you know, everybody's kind of trending even for prototyping now moving that offshore because in China you can do prototyping and get it within almost the same timeline. But the thing is with manufacturing, like it really helps to go there and kind of establish the relationship. Yeah.Alessio [00:52:38]: My first company was a hardware company and we did our PCBs in China and took a long time. Now things are better. But this was, yeah, I don't know, 10 years ago, something like that. Yeah.Ethan [00:52:47]: I think that like the, and I've heard this too, we didn't run into this problem, but like, you know, if it's something where you don't have the relationship, they don't see you, they don't know you, you know, you might get subcontracted out or like they're not paying attention. But like if you're, you know, you have the relationship and a priority, like, yeah, it's really good. We ended up doing the fabrication assembly in Taiwan for various reasons.Maria [00:53:11]: And I think it really helped the fact that you went there at some point. Yeah.Ethan [00:53:15]: We're really happy with the process and, but I mean the whole process of just Choosing the right people. Choosing the right people, but also just sourcing the bill materials and all of that stuff. Like, I guess like if you have time, it's not that bad, but if you're trying to like really push the speed at that, it's incredibly stressful. Okay. We got to move to the software. Yeah.Alessio [00:53:38]: Yeah. So the hardware, maybe it's hard for people to understand, but what software people can understand is that running. Transcription and summarization, all of these things in real time every day for 24 hours a day. It's not easy. So you mentioned 200,000 tokens for a day. Yeah. How do you make it basically free to run all of this for the consumer?Ethan [00:53:59]: Well, I think that the pipeline and the inference, like people think about all of these tokens, but as you know, the price of tokens is like dramatically dropping. You guys probably have some charts somewhere that you've posted. We do. And like, if you see that trend in like 250,000 input tokens, it's not really that much, right? Like the output.swyx [00:54:21]: You do several layers. You do live. Yeah.Ethan [00:54:23]: Yeah. So the speech to text is like the most challenging part actually, because you know, it requires like real time processing and then like later processing with a larger model. And one thing that is fairly obvious is that like, you don't need to transcribe things that don't have any voice in it. Right? So good voice activity is key, right? Because like the majority of most people's day is not spent with voice activity. Right? So that is the first step to cutting down the amount of compute you have to do. And voice activity is a fairly cheap thing to do. Very, very cheap thing to do. The models that need to summarize, you don't need a Sonnet level kind of model to summarize. You do need a Sonnet level model to like execute things like the agent. And we will be having a subscription for like features like that because it's, you know, although now with the R1, like we'll see, we haven't evaluated it. A deep seek? Yeah. I mean, not that one in particular, but like, you know, they're already there that can kind of perform at that level. I was like, it's going to stay in six months, but like, yeah. So self-hosted models help in the things where you can. So you are self-hosting models. Yes. You are fine tuning your own ASR. Yes. I will say that I see in the future that everything's trending down. Although like, I think there might be an intermediary step with things to become expensive, which is like, we're really interested because like the pipeline is very tedious and like a lot of tuning. Right. Which is brutal because it's just a lot of trial and error. Whereas like, well, wouldn't it be nice if an end to end model could just do all of this and learn it? If we could do transcription with like an LLM, there's so many advantages to that, but it's going to be a larger model and hence like more compute, you know, we're optimistic. Maybe we could distill something down and like, we kind of more than focus on reducing the cost of the existing pipeline or trying to the next generation. Cause it's very clear that like all ASR, all speech to the text is going to be pretty obsolete pretty soon. So like investing into that is probably kind of a dead end. Cause it's just going to be. It's going to be obsolete.swyx [00:56:39]: It's interesting. Like I think when I initially invested in tab this is, this shows you how wrong I was. I was like, oh, this is a sort of razor blades, blade razors and blades model where you sell a cheap hardware and you make up a subscription, like a monthly subscription. And now I just checked friend is a one-time sale, $99 limitless one-time sale, $99. These guys one-time sale, $49 and inference is free. What? Wow. It's crazy.Ethan [00:57:09]: I think when you probably invested, like how much was a million input tokens at that time and what is it now?swyx [00:57:15]: It's a fascinating business and like, you know, there's a lot to dig into there, but just getting that perspective out there is, I think it's not something that people think about a lot.Alessio [00:57:24]: And you obviously have thought a lot about. What about memory? I think this is something we go back and forth on about memory as in you're just memorizing facts and then understanding implicit preference and adjusting facts that you think are important. Have you ever done something about a person? Any learnings from that? I know there's a lot of open source frameworks now that do it that you build all of your own infrastructure internally.Ethan [00:57:46]: Yeah, we did. I mean I evaluated used a lot in other projects. I think that there's a few different tasks or things that revolve around memory. Like one is like retrieval obviously. And like when you need to find like even if you have a large corpus of how do you find? And so like I think existing kind of rag pipelines also will probably be the most helpful. The frameworks, I have not found one, like, there's no general way to do RAG that works, like, it's really highly dependent on the data. So, like, if you're going to be customizing something that much, it's just, you get kind of more bang from the buck from designing it all yourself. You know, a lot of those frameworks are great for getting going quickly. But I think it's really interesting memory when you're trying to do, for a person, because memory is decay, right? Like, I'm going to London, you know, then I come back, I'm not going to London anymore. What we've learned is, like, doing the traditional, like, embedding and RAG is suboptimal. We kind of built our own using small models to do really massively parallel retrieval. Which I think is going to be maybe more common in the future. And then, like, how to represent a person. We still require some human loop. And I mean, this is an ongoing project. And, you know, we're learning every day. Like, how do you correct the model when it gets something wrong about you? Right now, we have, like, things that are, like, super confirmed that are, like, ground truth about you because the human accepted it. But ideally, like, that step wouldn't be necessary. And then we have things that are fuzzier. And, like, the more... Stuff that we know is true, the more accurate we are when we're trying to decide, is this fuzzy stuff? Because it's probably, like, if you have the context, it's probably not true. So I think it's one of the most core challenges is how to handle both retrieval and then modeling and, like, especially when you're dealing with noisy source data. Because, like, even if, in an ideal world, even if you just had perfect transcription and you're going off that, that's still not enough information, right? And even if you had visual, it's still not enough. Like, there's still going to be...Alessio [00:59:55]: Yeah, one way I think about it is I usually like to order the same thing from the same restaurant if I like it. But I'm not saying that out loud. And it's kind of like, are these type of behaviors? Like, when you ask about a favorite restaurant, I would just want it to give me restaurants that I've already been to that I like. Or, like, if I'm like, hey, just order something. from this place, I should just reorder the same thing. Because it knows that I like to redo the same thing. But I feel like today, most agent memory things that I see people publish, it's like, you know, just write down the data thing.Ethan [01:00:39]: Yeah, I mean, I think that's why the reasoning, like, in our case, like, giving it time to consider all of the sources it has. So, like, look at the email, see, like, the receipts, and then look at the conversations to see, like, what I've mentioned. And then be able to then take enough time to search through all the contexts and connect the dots is, I think, really important. And, like, I don't know, like, some of the agent memory stuff is, like, the key value with RAG on top. Like, and the results there are just not complete enough when you have, like, growing corpus and, like, managing decay and hallucinations that might be in the source material. So, this is where people usually bring in knowledge graphs. Yes. And do you do it? We don't extensively use knowledge graphs. It's something, you know, we didn't talk also about the kind of potential future social aspects.Maria [01:01:33]: Yeah, I wanted to speak about it.Ethan [01:01:35]: But the problem with knowledge graphs that we found is, like, and I don't know if you can tell me what your experience has been, but they're great for representing the data, but then, like, using it at inference time is kind of challenging.swyx [01:01:49]: For speed or what other issues?Ethan [01:01:51]: Just, like, the LLM understanding. Like, the graph. Yeah. The input. Yeah, it's not in the training data, for sure. I think that the graph is the right kind of way to store the data, but, like, then you need to have the right retrieval and then just kind of formatting in a way that, like, doesn't just overwhelm or confuse what you're trying to do. Should we ask about social? Yeah, I thought you were going to go into it. Yeah. Like, not directly related. We did some experimentation. Not directly related to, like, graph retrieval or graph knowledge races. Yeah. Yeah. Yeah. Yeah. The idea that having, like, your personal context, but then, like, other people can query it, you know, it can divulge some things that you would have full control over. Then Maria and I are trying to negotiate, like, where we're going to dinner, like, there can be an exchange. We exactly did this experiment. Yeah. There can be an exchange between the agents and, like, oh.Maria [01:02:45]: So how, like, my agent can speak with Ethan's agent. Both of them, they know our location, what we like, where we went in the past. Yeah. And even, you know, if we have our calendar integrated, they know when we're free. So they can interact with each other and have a conversation and decide a place to go for us. Wow. And we did that. And it was, for me, really cool because they suggested to us a nice French restaurant that we went at the end.swyx [01:03:11]: That you've never been to?Maria [01:03:12]: That we've never been to. Okay. But both of us, they said that we like French food. Both of us, we were in Pacific Heights. And, yeah, this was really trivial. Yeah.Ethan [01:03:23]: It's a trivial, like, toy use. But I guess, like, in terms of you've been using it for a while, like, if I wanted to buy you a gift.Maria [01:03:30]: Oh, my God. You bought me a bunch of candles now that I think about it.Ethan [01:03:35]: This is another use case. I was like, yeah. When we were testing the agent, like, a bunch of candles from Amazon showed up at her door.Maria [01:03:43]: Yeah, because I love candles, but I didn't expect 20. Yeah.Ethan [01:03:47]: It was a lot of experimenting. But, like, how to manage that where it's like, what's okay for your B to divulge to him? Who? Yeah. Like, shouldn't you get an authorization request every time? Yeah, yeah, yeah.swyx [01:03:58]: For personal context. Yeah, yeah, yeah.Ethan [01:04:00]: So, like, you know, you would have to, human would have to sign off on it. But I think then, like, then I wouldn't have to guess. I could just.swyx [01:04:10]: Yeah, yeah. You know, there's this culture that, like, is very alien to everyone else outside of SF and outside the Gen Z bubble in SF, which is sharing, location sharing. Yeah. I can tell my close friends where they are exactly right now in the city. Yeah. And it's opt-in. And, like, it's. Dude. Dude. You know, and, like, it's normal and, like, it freaks out everyone who's not here. Yeah. Yeah. And so maybe we can share preference, like, who we like. Absolutely.Maria [01:04:34]: I really believe in it, for sure. We will.Ethan [01:04:36]: Or even, like, small updates about your day. My parents would love that because I don't do that. Yeah. now there's no friction. It can just be more or less automatic. Yeah. Dating? I was trained always to avoid dating. Really? As a startup founder. Yeah, you can hate that. Yeah. Everyone hates it?Maria [01:04:55]: We thought about it. Like, sometimes some people, they ask to us because it's like, oh, you know so much about me. Like, can you measure compatibility with somebody else or something like that? Yeah. Probably there is a future. Maybe somebody should build that. I think on our end, we were like, no, this is. We don't want to.Ethan [01:05:11]: I will build on your API. My sister is actually a personality psychology professor and she studies personality. And we were at Thanksgiving because my parents wear one. And I was like, ask it. Like, give me my big five. Yeah. Which is like the personality type. And it's like. Does it know my big five? Just ask it to consider everything and give your big five. And my sister said it was pretty. I didn't agree with it because it said I was disagreeable. I agree with that. But she seemed to think it was agreeable. And so.swyx [01:05:41]: You disagree that you're disagreeable? Yeah. Yeah. What other proof do we need then?Ethan [01:05:47]: Yeah. I think I'm very agreeable.Ethan [01:05:51]: But I think that we do. I did get some users are like, oh, if like we're a couple. Yeah.Maria [01:05:56]: We had like couples. Actually. They bought the product together. Yeah. Like both. Like couple. They bought the hardware. So there is something there. Another test is like the Myers-Briggs. I know that you don't like that one. No. No.swyx [01:06:08]: Ocean is cooler than Myers-Briggs. Yeah. Everyone stop using my MBTI. Use my. Use Ocean. Yeah.Maria [01:06:12]: Yeah. For me, like it was on point. Like every time. Like it. Awesome.Alessio [01:06:16]: Anything else that we didn't cover? Any cool underrated things?Maria [01:06:21]: Go to b.computer. Forty nine. Ninety nine. And you buy the device. That's the. That's the call to action.swyx [01:06:28]: And you're hiring?Maria [01:06:29]: We are hiring. For sure.Ethan [01:06:32]: AI engineers.Maria [01:06:33]: AI engineers. Nice. What is an AI engineer?Ethan [01:06:35]: Yeah. But did you study? Somebody who's scrappy and willing to.Maria [01:06:42]: Work with us. Yeah.Ethan [01:06:43]: I think. I think you coined the term, right? So you can tell us.Maria [01:06:48]: Somebody that can adapt. That has resistance. Yeah. Yeah.swyx [01:06:51]: People have different perspectives and what is useful for you is different from what is useful for me. Yeah. So anyway, it's so useful.Ethan [01:06:57]: I mean, I think that always on AI is really going to explode and it's going to be a lot from both a lot of startups, but incumbents and there's going to be all kinds of new things that we're going to learn about how it's going to change all of our lives. I think that's the thing I'm most certain about. So. And being AI.swyx [01:07:15]: Well, thanks very much. Thank you guys. This is a pleasure. Thank you. Yeah. We'll see you launch whenever. Thank you. I'm sure that launch is happening. Yeah. Thanks. Thank you. Get full access to Latent.Space at www.latent.space/subscribe
    --------  
    1:08:52
  • The AI Architect — Bret Taylor
    If you’re in SF, join us tomorrow for a fun meetup at CodeGen Night!If you’re in NYC, join us for AI Engineer Summit! The Agent Engineering track is now sold out, but 25 tickets remain for AI Leadership and 5 tickets for the workshops. You can see the full schedule of speakers and workshops at https://ai.engineer!It’s exceedingly hard to introduce someone like Bret Taylor. We could recite his Wikipedia page, or his extensive work history through Silicon Valley’s greatest companies, but everyone else already does that.As a podcast by AI engineers for AI engineers, we had the opportunity to do something a little different. We wanted to dig into what Bret sees from his vantage point at the top of our industry for the last 2 decades, and how that explains the rise of the AI Architect at Sierra, the leading conversational AI/CX platform.“Across our customer base, we are seeing a new role emerge - the role of the AI architect. These leaders are responsible for helping define, manage and evolve their company's AI agent over time. They come from a variety of both technical and business backgrounds, and we think that every company will have one or many AI architects managing their AI agent and related experience.”In our conversation, Bret Taylor confirms the Paul Buchheit legend that he rewrote Google Maps in a weekend, armed with only the help of a then-nascent Google Closure Compiler and no other modern tooling. But what we find remarkable is that he was the PM of Maps, not an engineer, though of course he still identifies as one. We find this theme recurring throughout Bret’s career and worldview. We think it is plain as day that AI leadership will have to be hands-on and technical, especially when the ground is shifting as quickly as it is today:“There's a lot of power in combining product and engineering into as few people as possible… few great things have been created by committee.”“If engineering is an order taking organization for product you can sometimes make meaningful things, but rarely will you create extremely well crafted breakthrough products. Those tend to be small teams who deeply understand the customer need that they're solving, who have a maniacal focus on outcomes.”“And I think the reason why is if you look at like software as a service five years ago, maybe you can have a separation of product and engineering because most software as a service created five years ago. I wouldn't say there's like a lot of technological breakthroughs required for most business applications. And if you're making expense reporting software or whatever, it's useful… You kind of know how databases work, how to build auto scaling with your AWS cluster, whatever, you know, it's just, you're just applying best practices to yet another problem. "When you have areas like the early days of mobile development or the early days of interactive web applications, which I think Google Maps and Gmail represent, or now AI agents, you're in this constant conversation with what the requirements of your customers and stakeholders are and all the different people interacting with it and the capabilities of the technology. And it's almost impossible to specify the requirements of a product when you're not sure of the limitations of the technology itself.”This is the first time the difference between technical leadership for “normal” software and for “AI” software was articulated this clearly for us, and we’ll be thinking a lot about this going forward. We left a lot of nuggets in the conversation, so we hope you’ll just dive in with us (and thank Bret for joining the pod!)Full YouTubePlease Like and Subscribe :)Timestamps* 00:00:02 Introductions and Bret Taylor's background* 00:01:23 Bret's experience at Stanford and the dot-com era* 00:04:04 The story of rewriting Google Maps backend* 00:11:06 Early days of interactive web applications at Google* 00:15:26 Discussion on product management and engineering roles* 00:21:00 AI and the future of software development* 00:26:42 Bret's approach to identifying customer needs and building AI companies* 00:32:09 The evolution of business models in the AI era* 00:41:00 The future of programming languages and software development* 00:49:38 Challenges in precisely communicating human intent to machines* 00:56:44 Discussion on Artificial General Intelligence (AGI) and its impact* 01:08:51 The future of agent-to-agent communication* 01:14:03 Bret's involvement in the OpenAI leadership crisis* 01:22:11 OpenAI's relationship with Microsoft* 01:23:23 OpenAI's mission and priorities* 01:27:40 Bret's guiding principles for career choices* 01:29:12 Brief discussion on pasta-making* 01:30:47 How Bret keeps up with AI developments* 01:32:15 Exciting research directions in AI* 01:35:19 Closing remarks and hiring at Sierra Transcript[00:02:05] Introduction and Guest Welcome[00:02:05] Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co host swyx, founder of smol.ai.[00:02:17] swyx: Hey, and today we're super excited to have Bret Taylor join us. Welcome. Thanks for having me. It's a little unreal to have you in the studio.[00:02:25] swyx: I've read about you so much over the years, like even before. Open AI effectively. I mean, I use Google Maps to get here. So like, thank you for everything that you've done. Like, like your story history, like, you know, I think people can find out what your greatest hits have been.[00:02:40] Bret Taylor's Early Career and Education[00:02:40] swyx: How do you usually like to introduce yourself when, you know, you talk about, you summarize your career, like, how do you look at yourself?[00:02:47] Bret: Yeah, it's a great question. You know, we, before we went on the mics here, we're talking about the audience for this podcast being more engineering. And I do think depending on the audience, I'll introduce myself differently because I've had a lot of [00:03:00] corporate and board roles. I probably self identify as an engineer more than anything else though.[00:03:04] Bret: So even when I was. Salesforce, I was coding on the weekends. So I think of myself as an engineer and then all the roles that I do in my career sort of start with that just because I do feel like engineering is sort of a mindset and how I approach most of my life. So I'm an engineer first and that's how I describe myself.[00:03:24] Bret: You majored in computer[00:03:25] swyx: science, like 1998. And, and I was high[00:03:28] Bret: school, actually my, my college degree was Oh, two undergrad. Oh, three masters. Right. That old.[00:03:33] swyx: Yeah. I mean, no, I was going, I was going like 1998 to 2003, but like engineering wasn't as, wasn't a thing back then. Like we didn't have the title of senior engineer, you know, kind of like, it was just.[00:03:44] swyx: You were a programmer, you were a developer, maybe. What was it like in Stanford? Like, what was that feeling like? You know, was it, were you feeling like on the cusp of a great computer revolution? Or was it just like a niche, you know, interest at the time?[00:03:57] Stanford and the Dot-Com Bubble[00:03:57] Bret: Well, I was at Stanford, as you said, from 1998 to [00:04:00] 2002.[00:04:02] Bret: 1998 was near the peak of the dot com bubble. So. This is back in the day where most people that they're coding in the computer lab, just because there was these sun microsystems, Unix boxes there that most of us had to do our assignments on. And every single day there was a. com like buying pizza for everybody.[00:04:20] Bret: I didn't have to like, I got. Free food, like my first two years of university and then the dot com bubble burst in the middle of my college career. And so by the end there was like tumbleweed going to the job fair, you know, it was like, cause it was hard to describe unless you were there at the time, the like level of hype and being a computer science major at Stanford was like, A thousand opportunities.[00:04:45] Bret: And then, and then when I left, it was like Microsoft, IBM.[00:04:49] Joining Google and Early Projects[00:04:49] Bret: And then the two startups that I applied to were VMware and Google. And I ended up going to Google in large part because a woman named Marissa Meyer, who had been a teaching [00:05:00] assistant when I was, what was called a section leader, which was like a junior teaching assistant kind of for one of the big interest.[00:05:05] Bret: Yes. Classes. She had gone there. And she was recruiting me and I knew her and it was sort of felt safe, you know, like, I don't know. I thought about it much, but it turned out to be a real blessing. I realized like, you know, you always want to think you'd pick Google if given the option, but no one knew at the time.[00:05:20] Bret: And I wonder if I'd graduated in like 1999 where I've been like, mom, I just got a job at pets. com. It's good. But you know, at the end I just didn't have any options. So I was like, do I want to go like make kernel software at VMware? Do I want to go build search at Google? And I chose Google. 50, 50 ball.[00:05:36] Bret: I'm not really a 50, 50 ball. So I feel very fortunate in retrospect that the economy collapsed because in some ways it forced me into like one of the greatest companies of all time, but I kind of lucked into it, I think.[00:05:47] The Google Maps Rewrite Story[00:05:47] Alessio: So the famous story about Google is that you rewrote the Google maps back in, in one week after the map quest quest maps acquisition, what was the story there?[00:05:57] Alessio: Is it. Actually true. Is it [00:06:00] being glorified? Like how, how did that come to be? And is there any detail that maybe Paul hasn't shared before?[00:06:06] Bret: It's largely true, but I'll give the color commentary. So it was actually the front end, not the back end, but it turns out for Google maps, the front end was sort of the hard part just because Google maps was.[00:06:17] Bret: Largely the first ish kind of really interactive web application, say first ish. I think Gmail certainly was though Gmail, probably a lot of people then who weren't engineers probably didn't appreciate its level of interactivity. It was just fast, but. Google maps, because you could drag the map and it was sort of graphical.[00:06:38] Bret: My, it really in the mainstream, I think, was it a map[00:06:41] swyx: quest back then that was, you had the arrows up and down, it[00:06:44] Bret: was up and down arrows. Each map was a single image and you just click left and then wait for a few seconds to the new map to let it was really small too, because generating a big image was kind of expensive on computers that day.[00:06:57] Bret: So Google maps was truly innovative in that [00:07:00] regard. The story on it. There was a small company called where two technologies started by two Danish brothers, Lars and Jens Rasmussen, who are two of my closest friends now. They had made a windows app called expedition, which had beautiful maps. Even in 2000.[00:07:18] Bret: For whenever we acquired or sort of acquired their company, Windows software was not particularly fashionable, but they were really passionate about mapping and we had made a local search product that was kind of middling in terms of popularity, sort of like a yellow page of search product. So we wanted to really go into mapping.[00:07:36] Bret: We'd started working on it. Their small team seemed passionate about it. So we're like, come join us. We can build this together.[00:07:42] Technical Challenges and Innovations[00:07:42] Bret: It turned out to be a great blessing that they had built a windows app because you're less technically constrained when you're doing native code than you are building a web browser, particularly back then when there weren't really interactive web apps and it ended up.[00:07:56] Bret: Changing the level of quality that we [00:08:00] wanted to hit with the app because we were shooting for something that felt like a native windows application. So it was a really good fortune that we sort of, you know, their unusual technical choices turned out to be the greatest blessing. So we spent a lot of time basically saying, how can you make a interactive draggable map in a web browser?[00:08:18] Bret: How do you progressively load, you know, new map tiles, you know, as you're dragging even things like down in the weeds of the browser at the time, most browsers like Internet Explorer, which was dominant at the time would only load two images at a time from the same domain. So we ended up making our map tile servers have like.[00:08:37] Bret: Forty different subdomains so we could load maps and parallels like lots of hacks. I'm happy to go into as much as like[00:08:44] swyx: HTTP connections and stuff.[00:08:46] Bret: They just like, there was just maximum parallelism of two. And so if you had a map, set of map tiles, like eight of them, so So we just, we were down in the weeds of the browser anyway.[00:08:56] Bret: So it was lots of plumbing. I can, I know a lot more about browsers than [00:09:00] most people, but then by the end of it, it was fairly, it was a lot of duct tape on that code. If you've ever done an engineering project where you're not really sure the path from point A to point B, it's almost like. Building a house by building one room at a time.[00:09:14] Bret: The, there's not a lot of architectural cohesion at the end. And then we acquired a company called Keyhole, which became Google earth, which was like that three, it was a native windows app as well, separate app, great app, but with that, we got licenses to all this satellite imagery. And so in August of 2005, we added.[00:09:33] Bret: Satellite imagery to Google Maps, which added even more complexity in the code base. And then we decided we wanted to support Safari. There was no mobile phones yet. So Safari was this like nascent browser on, on the Mac. And it turns out there's like a lot of decisions behind the scenes, sort of inspired by this windows app, like heavy use of XML and XSLT and all these like.[00:09:54] Bret: Technologies that were like briefly fashionable in the early two thousands and everyone hates now for good [00:10:00] reason. And it turns out that all of the XML functionality and Internet Explorer wasn't supporting Safari. So people are like re implementing like XML parsers. And it was just like this like pile of s**t.[00:10:11] Bret: And I had to say a s**t on your part. Yeah, of[00:10:12] Alessio: course.[00:10:13] Bret: So. It went from this like beautifully elegant application that everyone was proud of to something that probably had hundreds of K of JavaScript, which sounds like nothing. Now we're talking like people have modems, you know, not all modems, but it was a big deal.[00:10:29] Bret: So it was like slow. It took a while to load and just, it wasn't like a great code base. Like everything was fragile. So I just got. Super frustrated by it. And then one weekend I did rewrite all of it. And at the time the word JSON hadn't been coined yet too, just to give you a sense. So it's all XML.[00:10:47] swyx: Yeah.[00:10:47] Bret: So we used what is now you would call JSON, but I just said like, let's use eval so that we can parse the data fast. And, and again, that's, it would literally as JSON, but at the time there was no name for it. So we [00:11:00] just said, let's. Pass on JavaScript from the server and eval it. And then somebody just refactored the whole thing.[00:11:05] Bret: And, and it wasn't like I was some genius. It was just like, you know, if you knew everything you wished you had known at the beginning and I knew all the functionality, cause I was the primary, one of the primary authors of the JavaScript. And I just like, I just drank a lot of coffee and just stayed up all weekend.[00:11:22] Bret: And then I, I guess I developed a bit of reputation and no one knew about this for a long time. And then Paul who created Gmail and I ended up starting a company with him too, after all of this told this on a podcast and now it's large, but it's largely true. I did rewrite it and it, my proudest thing.[00:11:38] Bret: And I think JavaScript people appreciate this. Like the un G zipped bundle size for all of Google maps. When I rewrote, it was 20 K G zipped. It was like much smaller for the entire application. It went down by like 10 X. So. What happened on Google? Google is a pretty mainstream company. And so like our usage is shot up because it turns out like it's faster.[00:11:57] Bret: Just being faster is worth a lot of [00:12:00] percentage points of growth at a scale of Google. So how[00:12:03] swyx: much modern tooling did you have? Like test suites no compilers.[00:12:07] Bret: Actually, that's not true. We did it one thing. So I actually think Google, I, you can. Download it. There's a, Google has a closure compiler, a closure compiler.[00:12:15] Bret: I don't know if anyone still uses it. It's gone. Yeah. Yeah. It's sort of gone out of favor. Yeah. Well, even until recently it was better than most JavaScript minifiers because it was more like it did a lot more renaming of variables and things. Most people use ES build now just cause it's fast and closure compilers built on Java and super slow and stuff like that.[00:12:37] Bret: But, so we did have that, that was it. Okay.[00:12:39] The Evolution of Web Applications[00:12:39] Bret: So and that was treated internally, you know, it was a really interesting time at Google at the time because there's a lot of teams working on fairly advanced JavaScript when no one was. So Google suggest, which Kevin Gibbs was the tech lead for, was the first kind of type ahead, autocomplete, I believe in a web browser, and now it's just pervasive in search boxes that you sort of [00:13:00] see a type ahead there.[00:13:01] Bret: I mean, chat, dbt[00:13:01] swyx: just added it. It's kind of like a round trip.[00:13:03] Bret: Totally. No, it's now pervasive as a UI affordance, but that was like Kevin's 20 percent project. And then Gmail, Paul you know, he tells the story better than anyone, but he's like, you know, basically was scratching his own itch, but what was really neat about it is email, because it's such a productivity tool, just needed to be faster.[00:13:21] Bret: So, you know, he was scratching his own itch of just making more stuff work on the client side. And then we, because of Lars and Yen sort of like setting the bar of this windows app or like we need our maps to be draggable. So we ended up. Not only innovate in terms of having a big sync, what would be called a single page application today, but also all the graphical stuff you know, we were crashing Firefox, like it was going out of style because, you know, when you make a document object model with the idea that it's a document and then you layer on some JavaScript and then we're essentially abusing all of this, it just was running into code paths that were not.[00:13:56] Bret: Well, it's rotten, you know, at this time. And so it was [00:14:00] super fun. And, and, you know, in the building you had, so you had compilers, people helping minify JavaScript just practically, but there is a great engineering team. So they were like, that's why Closure Compiler is so good. It was like a. Person who actually knew about programming languages doing it, not just, you know, writing regular expressions.[00:14:17] Bret: And then the team that is now the Chrome team believe, and I, I don't know this for a fact, but I'm pretty sure Google is the main contributor to Firefox for a long time in terms of code. And a lot of browser people were there. So every time we would crash Firefox, we'd like walk up two floors and say like, what the hell is going on here?[00:14:35] Bret: And they would load their browser, like in a debugger. And we could like figure out exactly what was breaking. And you can't change the code, right? Cause it's the browser. It's like slow, right? I mean, slow to update. So, but we could figure out exactly where the bug was and then work around it in our JavaScript.[00:14:52] Bret: So it was just like new territory. Like so super, super fun time, just like a lot of, a lot of great engineers figuring out [00:15:00] new things. And And now, you know, the word, this term is no longer in fashion, but the word Ajax, which was asynchronous JavaScript and XML cause I'm telling you XML, but see the word XML there, to be fair, the way you made HTTP requests from a client to server was this.[00:15:18] Bret: Object called XML HTTP request because Microsoft and making Outlook web access back in the day made this and it turns out to have nothing to do with XML. It's just a way of making HTTP requests because XML was like the fashionable thing. It was like that was the way you, you know, you did it. But the JSON came out of that, you know, and then a lot of the best practices around building JavaScript applications is pre React.[00:15:44] Bret: I think React was probably the big conceptual step forward that we needed. Even my first social network after Google, we used a lot of like HTML injection and. Making real time updates was still very hand coded and it's really neat when you [00:16:00] see conceptual breakthroughs like react because it's, I just love those things where it's like obvious once you see it, but it's so not obvious until you do.[00:16:07] Bret: And actually, well, I'm sure we'll get into AI, but I, I sort of feel like we'll go through that evolution with AI agents as well that I feel like we're missing a lot of the core abstractions that I think in 10 years we'll be like, gosh, how'd you make agents? Before that, you know, but it was kind of that early days of web applications.[00:16:22] swyx: There's a lot of contenders for the reactive jobs of of AI, but no clear winner yet. I would say one thing I was there for, I mean, there's so much we can go into there. You just covered so much.[00:16:32] Product Management and Engineering Synergy[00:16:32] swyx: One thing I just, I just observe is that I think the early Google days had this interesting mix of PM and engineer, which I think you are, you didn't, you didn't wait for PM to tell you these are my, this is my PRD.[00:16:42] swyx: This is my requirements.[00:16:44] mix: Oh,[00:16:44] Bret: okay.[00:16:45] swyx: I wasn't technically a software engineer. I mean,[00:16:48] Bret: by title, obviously. Right, right, right.[00:16:51] swyx: It's like a blend. And I feel like these days, product is its own discipline and its own lore and own industry and engineering is its own thing. And there's this process [00:17:00] that happens and they're kind of separated, but you don't produce as good of a product as if they were the same person.[00:17:06] swyx: And I'm curious, you know, if, if that, if that sort of resonates in, in, in terms of like comparing early Google versus modern startups that you see out there,[00:17:16] Bret: I certainly like wear a lot of hats. So, you know, sort of biased in this, but I really agree that there's a lot of power and combining product design engineering into as few people as possible because, you know few great things have been created by committee, you know, and so.[00:17:33] Bret: If engineering is an order taking organization for product you can sometimes make meaningful things, but rarely will you create extremely well crafted breakthrough products. Those tend to be small teams who deeply understand the customer need that they're solving, who have a. Maniacal focus on outcomes.[00:17:53] Bret: And I think the reason why it's, I think for some areas, if you look at like software as a service five years ago, maybe you can have a [00:18:00] separation of product and engineering because most software as a service created five years ago. I wouldn't say there's like a lot of like. Technological breakthroughs required for most, you know, business applications.[00:18:11] Bret: And if you're making expense reporting software or whatever, it's useful. I don't mean to be dismissive of expense reporting software, but you probably just want to understand like, what are the requirements of the finance department? What are the requirements of an individual file expense report? Okay.[00:18:25] Bret: Go implement that. And you kind of know how web applications are implemented. You kind of know how to. How databases work, how to build auto scaling with your AWS cluster, whatever, you know, it's just, you're just applying best practices to yet another problem when you have areas like the early days of mobile development or the early days of interactive web applications, which I think Google Maps and Gmail represent, or now AI agents, you're in this constant conversation with what the requirements of your customers and stakeholders are and all the different people interacting with it.[00:18:58] Bret: And the capabilities of the [00:19:00] technology. And it's almost impossible to specify the requirements of a product when you're not sure of the limitations of the technology itself. And that's why I use the word conversation. It's not literal. That's sort of funny to use that word in the age of conversational AI.[00:19:15] Bret: You're constantly sort of saying, like, ideally, you could sprinkle some magic AI pixie dust and solve all the world's problems, but it's not the way it works. And it turns out that actually, I'll just give an interesting example.[00:19:26] AI Agents and Modern Tooling[00:19:26] Bret: I think most people listening probably use co pilots to code like Cursor or Devon or Microsoft Copilot or whatever.[00:19:34] Bret: Most of those tools are, they're remarkable. I'm, I couldn't, you know, imagine development without them now, but they're not autonomous yet. Like I wouldn't let it just write most code without my interactively inspecting it. We just are somewhere between it's an amazing co pilot and it's an autonomous software engineer.[00:19:53] Bret: As a product manager, like your aspirations for what the product is are like kind of meaningful. But [00:20:00] if you're a product person, yeah, of course you'd say it should be autonomous. You should click a button and program should come out the other side. The requirements meaningless. Like what matters is like, what is based on the like very nuanced limitations of the technology.[00:20:14] Bret: What is it capable of? And then how do you maximize the leverage? It gives a software engineering team, given those very nuanced trade offs. Coupled with the fact that those nuanced trade offs are changing more rapidly than any technology in my memory, meaning every few months you'll have new models with new capabilities.[00:20:34] Bret: So how do you construct a product that can absorb those new capabilities as rapidly as possible as well? That requires such a combination of technical depth and understanding the customer that you really need more integration. Of product design and engineering. And so I think it's why with these big technology waves, I think startups have a bit of a leg up relative to incumbents because they [00:21:00] tend to be sort of more self actualized in terms of just like bringing those disciplines closer together.[00:21:06] Bret: And in particular, I think entrepreneurs, the proverbial full stack engineers, you know, have a leg up as well because. I think most breakthroughs happen when you have someone who can understand those extremely nuanced technical trade offs, have a vision for a product. And then in the process of building it, have that, as I said, like metaphorical conversation with the technology, right?[00:21:30] Bret: Gosh, I ran into a technical limit that I didn't expect. It's not just like changing that feature. You might need to refactor the whole product based on that. And I think that's, that it's particularly important right now. So I don't, you know, if you, if you're building a big ERP system, probably there's a great reason to have product and engineering.[00:21:51] Bret: I think in general, the disciplines are there for a reason. I think when you're dealing with something as nuanced as the like technologies, like large language models today, there's a ton of [00:22:00] advantage of having. Individuals or organizations that integrate the disciplines more formally.[00:22:05] Alessio: That makes a lot of sense.[00:22:06] Alessio: I've run a lot of engineering teams in the past, and I think the product versus engineering tension has always been more about effort than like whether or not the feature is buildable. But I think, yeah, today you see a lot more of like. Models actually cannot do that. And I think the most interesting thing is on the startup side, people don't yet know where a lot of the AI value is going to accrue.[00:22:26] Alessio: So you have this rush of people building frameworks, building infrastructure, layered things, but we don't really know the shape of the compute. I'm curious that Sierra, like how you thought about building an house, a lot of the tooling for evals or like just, you know, building the agents and all of that.[00:22:41] Alessio: Versus how you see some of the startup opportunities that is maybe still out there.[00:22:46] Bret: We build most of our tooling in house at Sierra, not all. It's, we don't, it's not like not invented here syndrome necessarily, though, maybe slightly guilty of that in some ways, but because we're trying to build a platform [00:23:00] that's in Dorian, you know, we really want to have control over our own destiny.[00:23:03] Bret: And you had made a comment earlier that like. We're still trying to figure out who like the reactive agents are and the jury is still out. I would argue it hasn't been created yet. I don't think the jury is still out to go use that metaphor. We're sort of in the jQuery era of agents, not the react era.[00:23:19] Bret: And, and that's like a throwback for people listening,[00:23:22] swyx: we shouldn't rush it. You know?[00:23:23] Bret: No, yeah, that's my point is. And so. Because we're trying to create an enduring company at Sierra that outlives us, you know, I'm not sure we want to like attach our cart to some like to a horse where it's not clear that like we've figured out and I actually want as a company, we're trying to enable just at a high level and I'll, I'll quickly go back to tech at Sierra, we help consumer brands build customer facing AI agents.[00:23:48] Bret: So. Everyone from Sonos to ADT home security to Sirius XM, you know, if you call them on the phone and AI will pick up with you, you know, chat with them on the Sirius XM homepage. It's an AI agent called Harmony [00:24:00] that they've built on our platform. We're what are the contours of what it means for someone to build an end to end complete customer experience with AI with conversational AI.[00:24:09] Bret: You know, we really want to dive into the deep end of, of all the trade offs to do it. You know, where do you use fine tuning? Where do you string models together? You know, where do you use reasoning? Where do you use generation? How do you use reasoning? How do you express the guardrails of an agentic process?[00:24:25] Bret: How do you impose determinism on a fundamentally non deterministic technology? There's just a lot of really like as an important design space. And I could sit here and tell you, we have the best approach. Every entrepreneur will, you know. But I hope that in two years, we look back at our platform and laugh at how naive we were, because that's the pace of change broadly.[00:24:45] Bret: If you talk about like the startup opportunities, I'm not wholly skeptical of tools companies, but I'm fairly skeptical. There's always an exception for every role, but I believe that certainly there's a big market for [00:25:00] frontier models, but largely for companies with huge CapEx budgets. So. Open AI and Microsoft's Anthropic and Amazon Web Services, Google Cloud XAI, which is very well capitalized now, but I think the, the idea that a company can make money sort of pre training a foundation model is probably not true.[00:25:20] Bret: It's hard to, you're competing with just, you know, unreasonably large CapEx budgets. And I just like the cloud infrastructure market, I think will be largely there. I also really believe in the applications of AI. And I define that not as like building agents or things like that. I define it much more as like, you're actually solving a problem for a business.[00:25:40] Bret: So it's what Harvey is doing in legal profession or what cursor is doing for software engineering or what we're doing for customer experience and customer service. The reason I believe in that is I do think that in the age of AI, what's really interesting about software is it can actually complete a task.[00:25:56] Bret: It can actually do a job, which is very different than the value proposition of [00:26:00] software was to ancient history two years ago. And as a consequence, I think the way you build a solution and For a domain is very different than you would have before, which means that it's not obvious, like the incumbent incumbents have like a leg up, you know, necessarily, they certainly have some advantages, but there's just such a different form factor, you know, for providing a solution and it's just really valuable.[00:26:23] Bret: You know, it's. Like just think of how much money cursor is saving software engineering teams or the alternative, how much revenue it can produce tool making is really challenging. If you look at the cloud market, just as a analog, there are a lot of like interesting tools, companies, you know, Confluent, Monetized Kafka, Snowflake, Hortonworks, you know, there's a, there's a bunch of them.[00:26:48] Bret: A lot of them, you know, have that mix of sort of like like confluence or have the open source or open core or whatever you call it. I, I, I'm not an expert in this area. You know, I do think [00:27:00] that developers are fickle. I think that in the tool space, I probably like. Default towards open source being like the area that will win.[00:27:09] Bret: It's hard to build a company around this and then you end up with companies sort of built around open source to that can work. Don't get me wrong, but I just think that it's nowadays the tools are changing so rapidly that I'm like, not totally skeptical of tool makers, but I just think that open source will broadly win, but I think that the CapEx required for building frontier models is such that it will go to a handful of big companies.[00:27:33] Bret: And then I really believe in agents for specific domains which I think will, it's sort of the analog to software as a service in this new era. You know, it's like, if you just think of the cloud. You can lease a server. It's just a low level primitive, or you can buy an app like you know, Shopify or whatever.[00:27:51] Bret: And most people building a storefront would prefer Shopify over hand rolling their e commerce storefront. I think the same thing will be true of AI. So [00:28:00] I've. I tend to like, if I have a, like an entrepreneur asked me for advice, I'm like, you know, move up the stack as far as you can towards a customer need.[00:28:09] Bret: Broadly, but I, but it doesn't reduce my excitement about what is the reactive building agents kind of thing, just because it is, it is the right question to ask, but I think we'll probably play out probably an open source space more than anything else.[00:28:21] swyx: Yeah, and it's not a priority for you. There's a lot in there.[00:28:24] swyx: I'm kind of curious about your idea maze towards, there are many customer needs. You happen to identify customer experience as yours, but it could equally have been coding assistance or whatever. I think for some, I'm just kind of curious at the top down, how do you look at the world in terms of the potential problem space?[00:28:44] swyx: Because there are many people out there who are very smart and pick the wrong problem.[00:28:47] Bret: Yeah, that's a great question.[00:28:48] Future of Software Development[00:28:48] Bret: By the way, I would love to talk about the future of software, too, because despite the fact it didn't pick coding, I have a lot of that, but I can talk to I can answer your question, though, you know I think when a technology is as [00:29:00] cool as large language models.[00:29:02] Bret: You just see a lot of people starting from the technology and searching for a problem to solve. And I think it's why you see a lot of tools companies, because as a software engineer, you start building an app or a demo and you, you encounter some pain points. You're like,[00:29:17] swyx: a lot of[00:29:17] Bret: people are experiencing the same pain point.[00:29:19] Bret: What if I make it? That it's just very incremental. And you know, I always like to use the metaphor, like you can sell coffee beans, roasted coffee beans. You can add some value. You took coffee beans and you roasted them and roasted coffee beans largely, you know, are priced relative to the cost of the beans.[00:29:39] Bret: Or you can sell a latte and a latte. Is rarely priced directly like as a percentage of coffee bean prices. In fact, if you buy a latte at the airport, it's a captive audience. So it's a really expensive latte. And there's just a lot that goes into like. How much does a latte cost? And I bring it up because there's a supply chain from growing [00:30:00] coffee beans to roasting coffee beans to like, you know, you could make one at home or you could be in the airport and buy one and the margins of the company selling lattes in the airport is a lot higher than the, you know, people roasting the coffee beans and it's because you've actually solved a much more acute human problem in the airport.[00:30:19] Bret: And, and it's just worth a lot more to that person in that moment. It's kind of the way I think about technology too. It sounds funny to liken it to coffee beans, but you're selling tools on top of a large language model yet in some ways your market is big, but you're probably going to like be price compressed just because you're sort of a piece of infrastructure and then you have open source and all these other things competing with you naturally.[00:30:43] Bret: If you go and solve a really big business problem for somebody, that's actually like a meaningful business problem that AI facilitates, they will value it according to the value of that business problem. And so I actually feel like people should just stop. You're like, no, that's, that's [00:31:00] unfair. If you're searching for an idea of people, I, I love people trying things, even if, I mean, most of the, a lot of the greatest ideas have been things no one believed in.[00:31:07] Bret: So I like, if you're passionate about something, go do it. Like who am I to say, yeah, a hundred percent. Or Gmail, like Paul as far, I mean I, some of it's Laura at this point, but like Gmail is Paul's own email for a long time. , and then I amusingly and Paul can't correct me, I'm pretty sure he sent her in a link and like the first comment was like, this is really neat.[00:31:26] Bret: It would be great. It was not your email, but my own . I don't know if it's a true story. I'm pretty sure it's, yeah, I've read that before. So scratch your own niche. Fine. Like it depends on what your goal is. If you wanna do like a venture backed company, if its a. Passion project, f*****g passion, do it like don't listen to anybody.[00:31:41] Bret: In fact, but if you're trying to start, you know an enduring company, solve an important business problem. And I, and I do think that in the world of agents, the software industries has shifted where you're not just helping people more. People be more productive, but you're actually accomplishing tasks autonomously.[00:31:58] Bret: And as a consequence, I think the [00:32:00] addressable market has just greatly expanded just because software can actually do things now and actually accomplish tasks and how much is coding autocomplete worth. A fair amount. How much is the eventual, I'm certain we'll have it, the software agent that actually writes the code and delivers it to you, that's worth a lot.[00:32:20] Bret: And so, you know, I would just maybe look up from the large language models and start thinking about the economy and, you know, think from first principles. I don't wanna get too far afield, but just think about which parts of the economy. We'll benefit most from this intelligence and which parts can absorb it most easily.[00:32:38] Bret: And what would an agent in this space look like? Who's the customer of it is the technology feasible. And I would just start with these business problems more. And I think, you know, the best companies tend to have great engineers who happen to have great insight into a market. And it's that last part that I think some people.[00:32:56] Bret: Whether or not they have, it's like people start so much in the technology, they [00:33:00] lose the forest for the trees a little bit.[00:33:02] Alessio: How do you think about the model of still selling some sort of software versus selling more package labor? I feel like when people are selling the package labor, it's almost more stateless, you know, like it's easier to swap out if you're just putting an input and getting an output.[00:33:16] Alessio: If you think about coding, if there's no ID, you're just putting a prompt and getting back an app. It doesn't really matter. Who generates the app, you know, you have less of a buy in versus the platform you're building, I'm sure on the backend customers have to like put on their documentation and they have, you know, different workflows that they can tie in what's kind of like the line to draw there versus like going full where you're managed customer support team as a service outsource versus.[00:33:40] Alessio: This is the Sierra platform that you can build on. What was that decision? I'll sort of[00:33:44] Bret: like decouple the question in some ways, which is when you have something that's an agent, who is the person using it and what do they want to do with it? So let's just take your coding agent for a second. I will talk about Sierra as well.[00:33:59] Bret: Who's the [00:34:00] customer of a, an agent that actually produces software? Is it a software engineering manager? Is it a software engineer? And it's there, you know, intern so to speak. I don't know. I mean, we'll figure this out over the next few years. Like what is that? And is it generating code that you then review?[00:34:16] Bret: Is it generating code with a set of unit tests that pass, what is the actual. For lack of a better word contract, like, how do you know that it did what you wanted it to do? And then I would say like the product and the pricing, the packaging model sort of emerged from that. And I don't think the world's figured out.[00:34:33] Bret: I think it'll be different for every agent. You know, in our customer base, we do what's called outcome based pricing. So essentially every time the AI agent. Solves the problem or saves a customer or whatever it might be. There's a pre negotiated rate for that. We do that. Cause it's, we think that that's sort of the correct way agents, you know, should be packaged.[00:34:53] Bret: I look back at the history of like cloud software and notably the introduction of the browser, which led to [00:35:00] software being delivered in a browser, like Salesforce to. Famously invented sort of software as a service, which is both a technical delivery model through the browser, but also a business model, which is you subscribe to it rather than pay for a perpetual license.[00:35:13] Bret: Those two things are somewhat orthogonal, but not really. If you think about the idea of software running in a browser, that's hosted. Data center that you don't own, you sort of needed to change the business model because you don't, you can't really buy a perpetual license or something otherwise like, how do you afford making changes to it?[00:35:31] Bret: So it only worked when you were buying like a new version every year or whatever. So to some degree, but then the business model shift actually changed business as we know it, because now like. Things like Adobe Photoshop. Now you subscribe to rather than purchase. So it ended up where you had a technical shift and a business model shift that were very logically intertwined that actually the business model shift was turned out to be as significant as the technical as the shift.[00:35:59] Bret: And I think with [00:36:00] agents, because they actually accomplish a job, I do think that it doesn't make sense to me that you'd pay for the privilege of like. Using the software like that coding agent, like if it writes really bad code, like fire it, you know, I don't know what the right metaphor is like you should pay for a job.[00:36:17] Bret: Well done in my opinion. I mean, that's how you pay your software engineers, right? And[00:36:20] swyx: and well, not really. We paid to put them on salary and give them options and they vest over time. That's fair.[00:36:26] Bret: But my point is that you don't pay them for how many characters they write, which is sort of the token based, you know, whatever, like, There's a, that famous Apple story where we're like asking for a report of how many lines of code you wrote.[00:36:40] Bret: And one of the engineers showed up with like a negative number cause he had just like done a big refactoring. There was like a big F you to management who didn't understand how software is written. You know, my sense is like the traditional usage based or seat based thing. It's just going to look really antiquated.[00:36:55] Bret: Cause it's like asking your software engineer, how many lines of code did you write today? Like who cares? Like, cause [00:37:00] absolutely no correlation. So my old view is I don't think it's be different in every category, but I do think that that is the, if an agent is doing a job, you should, I think it properly incentivizes the maker of that agent and the customer of, of your pain for the job well done.[00:37:16] Bret: It's not always perfect to measure. It's hard to measure engineering productivity, but you can, you should do something other than how many keys you typed, you know Talk about perverse incentives for AI, right? Like I can write really long functions to do the same thing, right? So broadly speaking, you know, I do think that we're going to see a change in business models of software towards outcomes.[00:37:36] Bret: And I think you'll see a change in delivery models too. And, and, you know, in our customer base you know, we empower our customers to really have their hands on the steering wheel of what the agent does they, they want and need that. But the role is different. You know, at a lot of our customers, the customer experience operations folks have renamed themselves the AI architects, which I think is really cool.[00:37:55] Bret: And, you know, it's like in the early days of the Internet, there's the role of the webmaster. [00:38:00] And I don't know whether your webmaster is not a fashionable, you know, Term, nor is it a job anymore? I just, I don't know. Will they, our tech stand the test of time? Maybe, maybe not. But I do think that again, I like, you know, because everyone listening right now is a software engineer.[00:38:14] Bret: Like what is the form factor of a coding agent? And actually I'll, I'll take a breath. Cause actually I have a bunch of pins on them. Like I wrote a blog post right before Christmas, just on the future of software development. And one of the things that's interesting is like, if you look at the way I use cursor today, as an example, it's inside of.[00:38:31] Bret: A repackaged visual studio code environment. I sometimes use the sort of agentic parts of it, but it's largely, you know, I've sort of gotten a good routine of making it auto complete code in the way I want through tuning it properly when it actually can write. I do wonder what like the future of development environments will look like.[00:38:55] Bret: And to your point on what is a software product, I think it's going to change a lot in [00:39:00] ways that will surprise us. But I always use, I use the metaphor in my blog post of, have you all driven around in a way, Mo around here? Yeah, everyone has. And there are these Jaguars, the really nice cars, but it's funny because it still has a steering wheel, even though there's no one sitting there and the steering wheels like turning and stuff clearly in the future.[00:39:16] Bret: If once we get to that, be more ubiquitous, like why have the steering wheel and also why have all the seats facing forward? Maybe just for car sickness. I don't know, but you could totally rearrange the car. I mean, so much of the car is oriented around the driver, so. It stands to reason to me that like, well, autonomous agents for software engineering run through visual studio code.[00:39:37] Bret: That seems a little bit silly because having a single source code file open one at a time is kind of a goofy form factor for when like the code isn't being written primarily by you, but it begs the question of what's your relationship with that agent. And I think the same is true in our industry of customer experience, which is like.[00:39:55] Bret: Who are the people managing this agent? What are the tools do they need? And they definitely need [00:40:00] tools, but it's probably pretty different than the tools we had before. It's certainly different than training a contact center team. And as software engineers, I think that I would like to see particularly like on the passion project side or research side.[00:40:14] Bret: More innovation in programming languages. I think that we're bringing the cost of writing code down to zero. So the fact that we're still writing Python with AI cracks me up just cause it's like literally was designed to be ergonomic to write, not safe to run or fast to run. I would love to see more innovation and how we verify program correctness.[00:40:37] Bret: I studied for formal verification in college a little bit and. It's not very fashionable because it's really like tedious and slow and doesn't work very well. If a lot of code is being written by a machine, you know, one of the primary values we can provide is verifying that it actually does what we intend that it does.[00:40:56] Bret: I think there should be lots of interesting things in the software development life cycle, like how [00:41:00] we think of testing and everything else, because. If you think about if we have to manually read every line of code that's coming out as machines, it will just rate limit how much the machines can do. The alternative is totally unsafe.[00:41:13] Bret: So I wouldn't want to put code in production that didn't go through proper code review and inspection. So my whole view is like, I actually think there's like an AI native I don't think the coding agents don't work well enough to do this yet, but once they do, what is sort of an AI native software development life cycle and how do you actually.[00:41:31] Bret: Enable the creators of software to produce the highest quality, most robust, fastest software and know that it's correct. And I think that's an incredible opportunity. I mean, how much C code can we rewrite and rust and make it safe so that there's fewer security vulnerabilities. Can we like have more efficient, safer code than ever before?[00:41:53] Bret: And can you have someone who's like that guy in the matrix, you know, like staring at the little green things, like where could you have an operator [00:42:00] of a code generating machine be like superhuman? I think that's a cool vision. And I think too many people are focused on like. Autocomplete, you know, right now, I'm not, I'm not even, I'm guilty as charged.[00:42:10] Bret: I guess in some ways, but I just like, I'd like to see some bolder ideas. And that's why when you were joking, you know, talking about what's the react of whatever, I think we're clearly in a local maximum, you know, metaphor, like sort of conceptual local maximum, obviously it's moving really fast. I think we're moving out of it.[00:42:26] Alessio: Yeah. At the end of 23, I've read this blog post from syntax to semantics. Like if you think about Python. It's taking C and making it more semantic and LLMs are like the ultimate semantic program, right? You can just talk to them and they can generate any type of syntax from your language. But again, the languages that they have to use were made for us, not for them.[00:42:46] Alessio: But the problem is like, as long as you will ever need a human to intervene, you cannot change the language under it. You know what I mean? So I'm curious at what point of automation we'll need to get, we're going to be okay making changes. To the underlying languages, [00:43:00] like the programming languages versus just saying, Hey, you just got to write Python because I understand Python and I'm more important at the end of the day than the model.[00:43:08] Alessio: But I think that will change, but I don't know if it's like two years or five years. I think it's more nuanced actually.[00:43:13] Bret: So I think there's a, some of the more interesting programming languages bring semantics into syntax. So let me, that's a little reductive, but like Rust as an example, Rust is memory safe.[00:43:25] Bret: Statically, and that was a really interesting conceptual, but it's why it's hard to write rust. It's why most people write python instead of rust. I think rust programs are safer and faster than python, probably slower to compile. But like broadly speaking, like given the option, if you didn't have to care about the labor that went into it.[00:43:45] Bret: You should prefer a program written in Rust over a program written in Python, just because it will run more efficiently. It's almost certainly safer, et cetera, et cetera, depending on how you define safe, but most people don't write Rust because it's kind of a pain in the ass. And [00:44:00] the audience of people who can is smaller, but it's sort of better in most, most ways.[00:44:05] Bret: And again, let's say you're making a web service and you didn't have to care about how hard it was to write. If you just got the output of the web service, the rest one would be cheaper to operate. It's certainly cheaper and probably more correct just because there's so much in the static analysis implied by the rest programming language that it probably will have fewer runtime errors and things like that as well.[00:44:25] Bret: So I just give that as an example, because so rust, at least my understanding that came out of the Mozilla team, because. There's lots of security vulnerabilities in the browser and it needs to be really fast. They said, okay, we want to put more of a burden at the authorship time to have fewer issues at runtime.[00:44:43] Bret: And we need the constraint that it has to be done statically because browsers need to be really fast. My sense is if you just think about like the, the needs of a programming language today, where the role of a software engineer is [00:45:00] to use an AI to generate functionality and audit that it does in fact work as intended, maybe functionally, maybe from like a correctness standpoint, some combination thereof, how would you create a programming system that facilitated that?[00:45:15] Bret: And, you know, I bring up Rust is because I think it's a good example of like, I think given a choice of writing in C or Rust, you should choose Rust today. I think most people would say that, even C aficionados, just because. C is largely less safe for very similar, you know, trade offs, you know, for the, the system and now with AI, it's like, okay, well, that just changes the game on writing these things.[00:45:36] Bret: And so like, I just wonder if a combination of programming languages that are more structurally oriented towards the values that we need from an AI generated program, verifiable correctness and all of that. If it's tedious to produce for a person, that maybe doesn't matter. But one thing, like if I asked you, is this rest program memory safe?[00:45:58] Bret: You wouldn't have to read it, you just have [00:46:00] to compile it. So that's interesting. I mean, that's like an, that's one example of a very modest form of formal verification. So I bring that up because I do think you have AI inspect AI, you can have AI reviewed. Do AI code reviews. It would disappoint me if the best we could get was AI reviewing Python and having scaled a few very large.[00:46:21] Bret: Websites that were written on Python. It's just like, you know, expensive and it's like every, trust me, every team who's written a big web service in Python has experimented with like Pi Pi and all these things just to make it slightly more efficient than it naturally is. You don't really have true multi threading anyway.[00:46:36] Bret: It's just like clearly that you do it just because it's convenient to write. And I just feel like we're, I don't want to say it's insane. I just mean. I do think we're at a local maximum. And I would hope that we create a programming system, a combination of programming languages, formal verification, testing, automated code reviews, where you can use AI to generate software in a high scale way and trust it.[00:46:59] Bret: And you're [00:47:00] not limited by your ability to read it necessarily. I don't know exactly what form that would take, but I feel like that would be a pretty cool world to live in.[00:47:08] Alessio: Yeah. We had Chris Lanner on the podcast. He's doing great work with modular. I mean, I love. LVM. Yeah. Basically merging rust in and Python.[00:47:15] Alessio: That's kind of the idea. Should be, but I'm curious is like, for them a big use case was like making it compatible with Python, same APIs so that Python developers could use it. Yeah. And so I, I wonder at what point, well, yeah.[00:47:26] Bret: At least my understanding is they're targeting the data science Yeah. Machine learning crowd, which is all written in Python, so still feels like a local maximum.[00:47:34] Bret: Yeah.[00:47:34] swyx: Yeah, exactly. I'll force you to make a prediction. You know, Python's roughly 30 years old. In 30 years from now, is Rust going to be bigger than Python?[00:47:42] Bret: I don't know this, but just, I don't even know this is a prediction. I just am sort of like saying stuff I hope is true. I would like to see an AI native programming language and programming system, and I use language because I'm not sure language is even the right thing, but I hope in 30 years, there's an AI native way we make [00:48:00] software that is wholly uncorrelated with the current set of programming languages.[00:48:04] Bret: or not uncorrelated, but I think most programming languages today were designed to be efficiently authored by people and some have different trade offs.[00:48:15] Evolution of Programming Languages[00:48:15] Bret: You know, you have Haskell and others that were designed for abstractions for parallelism and things like that. You have programming languages like Python, which are designed to be very easily written, sort of like Perl and Python lineage, which is why data scientists use it.[00:48:31] Bret: It's it can, it has a. Interactive mode, things like that. And I love, I'm a huge Python fan. So despite all my Python trash talk, a huge Python fan wrote at least two of my three companies were exclusively written in Python and then C came out of the birth of Unix and it wasn't the first, but certainly the most prominent first step after assembly language, right?[00:48:54] Bret: Where you had higher level abstractions rather than and going beyond go to, to like abstractions, [00:49:00] like the for loop and the while loop.[00:49:01] The Future of Software Engineering[00:49:01] Bret: So I just think that if the act of writing code is no longer a meaningful human exercise, maybe it will be, I don't know. I'm just saying it sort of feels like maybe it's one of those parts of history that just will sort of like go away, but there's still the role of this offer engineer, like the person actually building the system.[00:49:20] Bret: Right. And. What does a programming system for that form factor look like?[00:49:25] React and Front-End Development[00:49:25] Bret: And I, I just have a, I hope to be just like I mentioned, I remember I was at Facebook in the very early days when, when, what is now react was being created. And I remember when the, it was like released open source I had left by that time and I was just like, this is so f*****g cool.[00:49:42] Bret: Like, you know, to basically model your app independent of the data flowing through it, just made everything easier. And then now. You know, I can create, like there's a lot of the front end software gym play is like a little chaotic for me, to be honest with you. It is like, it's sort of like [00:50:00] abstraction soup right now for me, but like some of those core ideas felt really ergonomic.[00:50:04] Bret: I just wanna, I'm just looking forward to the day when someone comes up with a programming system that feels both really like an aha moment, but completely foreign to me at the same time. Because they created it with sort of like from first principles recognizing that like. Authoring code in an editor is maybe not like the primary like reason why a programming system exists anymore.[00:50:26] Bret: And I think that's like, that would be a very exciting day for me.[00:50:28] The Role of AI in Programming[00:50:28] swyx: Yeah, I would say like the various versions of this discussion have happened at the end of the day, you still need to precisely communicate what you want. As a manager of people, as someone who has done many, many legal contracts, you know how hard that is.[00:50:42] swyx: And then now we have to talk to machines doing that and AIs interpreting what we mean and reading our minds effectively. I don't know how to get across that barrier of translating human intent to instructions. And yes, it can be more declarative, but I don't know if it'll ever Crossover from being [00:51:00] a programming language to something more than that.[00:51:02] Bret: I agree with you. And I actually do think if you look at like a legal contract, you know, the imprecision of the English language, it's like a flaw in the system. How many[00:51:12] swyx: holes there are.[00:51:13] Bret: And I do think that when you're making a mission critical software system, I don't think it should be English language prompts.[00:51:19] Bret: I think that is silly because you want the precision of a a programming language. My point was less about that and more about if the actual act of authoring it, like if you.[00:51:32] Formal Verification in Software[00:51:32] Bret: I'll think of some embedded systems do use formal verification. I know it's very common in like security protocols now so that you can, because the importance of correctness is so great.[00:51:41] Bret: My intellectual exercise is like, why not do that for all software? I mean, probably that's silly just literally to do what we literally do for. These low level security protocols, but the only reason we don't is because it's hard and tedious and hard and tedious are no longer factors. So, like, if I could, I mean, [00:52:00] just think of, like, the silliest app on your phone right now, the idea that that app should be, like, formally verified for its correctness feels laughable right now because, like, God, why would you spend the time on it?[00:52:10] Bret: But if it's zero costs, like, yeah, I guess so. I mean, it never crashed. That's probably good. You know, why not? I just want to, like, set our bars really high. Like. We should make, software has been amazing. Like there's a Mark Andreessen blog post, software is eating the world. And you know, our whole life is, is mediated digitally.[00:52:26] Bret: And that's just increasing with AI. And now we'll have our personal agents talking to the agents on the CRO platform and it's agents all the way down, you know, our core infrastructure is running on these digital systems. We now have like, and we've had a shortage of software developers for my entire life.[00:52:45] Bret: And as a consequence, you know if you look, remember like health care, got healthcare. gov that fiasco security vulnerabilities leading to state actors getting access to critical infrastructure. I'm like. We now have like created this like amazing system that can [00:53:00] like, we can fix this, you know, and I, I just want to, I'm both excited about the productivity gains in the economy, but I just think as software engineers, we should be bolder.[00:53:08] Bret: Like we should have aspirations to fix these systems so that like in general, as you said, as precise as we want to be in the specification of the system. We can make it work correctly now, and I'm being a little bit hand wavy, and I think we need some systems. I think that's where we should set the bar, especially when so much of our life depends on this critical digital infrastructure.[00:53:28] Bret: So I'm I'm just like super optimistic about it. But actually, let's go to what you said for a second, which is correct.[00:53:33] The Importance of Specifications[00:53:33] Bret: Specifications. I think this is the most interesting part of A. I. Agents broadly, which is that most specifications are incomplete. So let's go back to our product engineering discussions.[00:53:45] Bret: You're like, okay, here's a P. R. D. Product requirements document and there's it's really detailed mockups and this like when you click this button, it does this and it's like 100 percent you can think of a missing requirement that [00:54:00] document. Let's say you click this button And the internet goes out, what do you do?[00:54:04] Bret: I don't know if that's in the PRD. It probably isn't, you know, there's, there's always going to be something because like humans are complicated. Right. So what ends up happening is like, I don't know if you can measure it, like what percentage of a product's actual functionality is determined by its code versus the specification, like for a traditional product, Oh, 95%.[00:54:24] Bret: I mean, a little bit, but a lot of it. So like. Code is the specification.[00:54:29] Open Source and Implicit Standards[00:54:29] Bret: It's actually why if you just look at the history of technology, why open source has won out over specifications, like, you know, for a long time, there was a W3C working group on the HTML specification and then, you know, once web kit became prevalent.[00:54:46] Bret: The internet evolved a lot faster and it's not the expense of the standards organizations. It just turns out having a committee of people argue is like a lot less efficient than someone checking in code and then all of a sudden you had vector graphics and you had like [00:55:00] all this really cool stuff that, you know, someone who, in the Google maps days, a guy like, God, that would have made my life easier.[00:55:05] Bret: You know, it's like. SVG support, life would have been a breeze. Try drawing a driving directions line without vector graphics. And so, you know, in general, I think we've gone from these protocols defined in a document to basically open source code that becomes an implicit standard, like systems calls and Linux, like.[00:55:26] Bret: There is a specification. There is post X as a standard, but like the Colonel is the like, that's what people write against and it's both the documented behavior and all of the undocumented behaviors as well for better for worse. And it's why, you know, Linus and others are so adamant about things like binary compatibility and all that, like this stuff matters.[00:55:48] Bret: So one of the things that I really think about is like working with agents broadly is how do you, it's. I don't want to say it's easy to specify the guardrails, you know, [00:56:00] but what about all those unspecified behaviors? So so much of like being a software engineer is like, you come to the point where you're like the internet's out and you get back the error code from the call and you got to do something with it.[00:56:12] Bret: And you know, what percent of the time do you just be like. Yeah, I'm going to do this because it seems reasonable. And what percentage of time do you like write a slack to your PM and be like, what do I do in this case? It's probably more the former than the latter. Otherwise it'd be really fricking inefficient to write software.[00:56:27] AI Agents and Decision Making[00:56:27] Bret: But what happens when your AI makes that decision for you? It's not a wrong decision. You didn't say anything about that case. The AI agent, the word agent comes from the word agency, right? So it's demonstrating its agency and it's making a decision. Does it document it? That would probably be tedious to like, because there's so many implicit decisions.[00:56:44] Bret: What happens when you click the button and the internet's out? It does something you don't like. How do you fix it? I actually think that we are like entering this new world where like the, how we express to an AI agent, what we want [00:57:00] is always going to be an incomplete specification, and that's why agents are useful because they can fill in the gaps with some decent amount of reasoning.[00:57:07] Bret: How you actually tune these over time. And imagine like building an app with an AI agent as your software engineering companion, there's like an infinitely long tail. Infinite is probably over exaggerating a bit, but there's a fairly long tail of functionality that I guarantee is not specified how you actually tune that.[00:57:25] Bret: And this is what I mean about creating a programming system. I don't think we know what that system is yet. And then similarly, I actually think for every single agentic domain, whether it's customer service or legal or software engineering, that's essentially what the company building those agents is building is like the system through which you express the behaviors you want, esoteric and small as it might be anyway, I think that's a really exciting area though, just because I think that's where the magic or that's where the product insights will be in the space is like, how do you encounter that those moments?[00:57:56] Bret: It's kind of built into the UX[00:57:58] swyx: and it can't just be, [00:58:00] the answer can't just be prompt better, you know? No, no, it's impossible.[00:58:04] Bret: The prompt would be too long. Like, imagine getting a PRD that literally specified the behavior of everything that was represented by code. The answer would just be code. Like at that point.[00:58:14] Bret: So here's my point, like prompts are great, but it's not actually a complete specification for anything. It never can be. And so, and I think that's. How you do interactivity, like the sort of human in a loop thing, when and how you do it. And that's why I really believe in, in domain specific agents, because I think answering that in the abstract is like a interesting intellectual exercise.[00:58:39] Bret: But I, that's why I like talking about agents in the abstract kind of, I'm actively disinterested in it because I don't think it actually means anything. All it means is software is making decisions. That's what, you know, at least in a reductive way. But in the context of software engineering, it does make sense.[00:58:53] Bret: Cause you know, like what is the process of first you specify what you want in a product, then you use it, then you give [00:59:00] feedback. You can imagine building a product that actually facilitated that closed loop system. And then how is that represented that complete specification of both what you knew you wanted, what you discovered through usage, the union of all of that is what you care about, and the rest is less to the AI.[00:59:16] Bret: In the legal context, I'm certain there's a way to know, like, when should the AI ask questions? When shouldn't it? How do you actually intervene when it's wrong? And certainly in the customer service case, it's very clear, you know, and how, like how we, our customers review every conversation, how we. Help them find the conversations they should review when they're having millions so they can find the few that are interesting how when something is wrong in one of those conversations, how they can give feedback.[00:59:42] Bret: So it's fixed the next time in a way where we know the context of why I made that decision. But it's not up to us what's right, right? It's up to our customers. So that's why I actually think for right, you know, right now when you think about building an agent and domain to some degree, how you actually interact with the [01:00:00] people specifies behavior is actually where a lot of the magic is.[01:00:03] swyx: Stop me if this is a little bit annoying to you, but I have a bit of a trouble squaring. domain specific agents with the belief that AGI is real, or AGI is coming, because the point is general intelligence. And some part, some way, one way to view the bitter lesson is we can always make progress on being more domain specific.[01:00:22] swyx: Take whatever SOTA is, and you make progress being more domain specific, and then you will be wiped out. The next advance happens. Clearly, you don't believe in that, but how do you personally square those things?[01:00:34] Bret: Yeah, it's a really heavy question.[01:00:36] The Impact of AGI on Industries[01:00:36] Bret: And you know, I think a lot about AGI given my role at open AI but it's even hard for me to really conceptualize.[01:00:41] Bret: And I love spending time with open AI researchers and actually just like people in the community broadly just talking about the implications because there's the first order of fact and I effects of something that is super intelligent in some domains. And then there's the second and third order effects are harder to predict.[01:00:57] Bret: So first as I think that. [01:01:00] It seems likely to me that, you know, at first and something that is AGI will be good in digital domains. You know, because it's software. So if you think about something like AI discovering a new say like pharmaceutical therapy, the barrier to that is probably less the discovery than the clinical trial.[01:01:23] Bret: And, and AI doesn't necessarily help with a clinical trial, right? That's a process that's. Independent of intelligence and it's, it's a physical process. Similarly, if you think about the problem of climate change or like carbon removal, there's probably a lot of that domain that requires great ideas, but like whatever great idea you came up with, if you wanted to sequester that much carbon, there's probably a big physical component to that.[01:01:47] Bret: So it's not really limited by intelligence. It might be, I'm sure it could be accelerated somewhat by intelligence. There's a really interesting conversation with an economist named Tyler Cohen, California. And recently he just, I just watched a video [01:02:00] of him and he was just talking about how there's parts of the economy where intelligence is sort of the limited resource that will take on AI slash AGI really rapidly and will drive incredible productivity gains.[01:02:13] Bret: But there are other parts of the economy that aren't and those will interact. It goes back to these complex second artifacts like prices will go up in the domains that can absorb absorb intelligence rapidly, which will actually then slow down, you know, so it's going to, I don't think it'll be evenly spread.[01:02:28] Bret: I don't think it would be perhaps as rapidly felt in all parts of the economy as people think I might be wrong, but I just think you can generalize in terms of its ability to. Reason about different domains, which I think is what AGI means to most people, but it may not actually. Generalized in the world and tell, because there's a lot of intelligence is not the limiting factor and like a lot of the economy.[01:02:54] Bret: So going back to your, your more practical question is like, why make software at all of, you know, AGI is coming and [01:03:00] say it that way. Should we learn to[01:03:01] swyx: code?[01:03:01] Bret: There's all variations of this. You know, my view is that I really do view AI as a tool and AGI as a tool for humanity. And so my view is when we were talking about like.[01:03:14] Bret: Is your job as a maker of software to author a code in an editor? I would argue no just like a generation ago. Your job wasn't to punch cards in a punch card That is not what your job is. Your job is to produce digital something, whatever it is, what is the purpose of the software that you're making?[01:03:34] Bret: Your job is to produce that. And so I think that like our jobs will change rapidly and meaningfully, but I think the idea that like our job is to type in a. And an editor is, is an artifact of the tools that we have, not actually what we're hired to do, which is to produce a digital experience, to, you know, make firmware for a toaster or whatever, whatever it is we're [01:04:00] doing.[01:04:00] Bret: Right. Like that's our job. Right. And. As a consequence, I think with things like AGI, I think the certainly software engineering will be one of the disciplines most impacted. And I think that it's very, so like, I think if you're in this industry and you define yourself by the tools that you use, like how many characters you can type into them every day, that's probably not like a long term stable place to be, because that's something that certainly AI can do better than you.[01:04:33] Bret: But your judgment about what to build and how to build it still apply. And that will always be true. And one way to think about it's like a little bit reductive is like, you know, look at startups versus larger companies. Like companies like Google and Amazon have so many more engineers than a startup, but then some startups still win.[01:04:51] Bret: Like, why was that? Well, they made better decisions, right? They didn't type faster or produce more code. They did the right thing in the right market, the right time. [01:05:00] And, and similarly. If you look at some of the great companies, it wasn't the lack of they had some unique idea. Sometimes that's a reason why a company succeeds, but it's often a lot of other things and a lot of other forms of execution.[01:05:12] Bret: So like broadly, like the existence of a lot of intelligence will change a lot and it'll change our jobs more than any other industry, or maybe not, maybe it's exaggerated, but certainly as much as any other industry. But I don't think it like changes, like why the economy around digital technology exists.[01:05:29] Bret: And as a consequence, I think I'm really bullish on like the future of, of the software industry. I just think that like some things that are really expensive today will become almost free. And but I think that, I mean, let's be honest, the half life of technology companies is not particularly long as it is.[01:05:46] Bret: Yeah, I, I brought this anecdote in a recent conversation, but When I started at Google, we were in one building in Mountain View and then eventually moved into a campus, which was previously the Silicon Graphics campus. That was the first campus Google, I'm pretty sure it [01:06:00] still has that campus. I think it's got a billion now.[01:06:02] Bret: SGI was a company that was like really, really big, big enough to have a campus and then went out of business. And it wasn't that old of a company, by the way, it's not like IBM, you know, it was like. Big enough to get a campus and go to business in my lifetime, you know, that type of thing. And then at Facebook, we had an office in pallets.[01:06:18] Bret: I moved, I didn't go into the original office when I joined. It was the second office, this old HP building near Stanford. And then we got big enough to want to campus and we bought some microsystems campus. Sun Microsystem famously came out of Stanford, went high flying, was one of the. com darlings, and then eventually sort of like bought for pennies on the dollar by Oracle.[01:06:39] Bret: And you know, like all those companies, like in my lifetime were big enough to like go public, have a campus and then go out of business. So I think a lot will change. I don't mean to say this is going to be easy or like no one's business model is under threat, but. Will digital technology remain important?[01:06:56] Bret: Will entrepreneurs having good judgment about where to [01:07:00] apply this technology to create something of economic value still apply like a hundred percent. And I've always used the metaphor, like if you went back to 1980 and describe many of the jobs that we have, it would be hard for people to conceptualize.[01:07:13] Bret: Like imagine. I'm a podcaster. You're like, what the hell does that mean? Imagine going back to like 1776 and describing to Ben Franklin, our economy today, like let alone the technology industry, just the services economy. It would be probably hard for him to conceptualize just like who grows the food, just because the idea that so few people in this country are necessary to produce the food for so many people would defy.[01:07:39] Bret: So much of his conception of just like how food is grown, that it would just be like, it would probably take a couple hours of explaining. It's kind of like the same thing. It's like we, we have a view of like how this world works right now. That's based on just the constraints that exist, but there's gonna be a lot of other opportunities and other things like that.[01:07:57] Bret: So I don't know. I mean, it's certainly [01:08:00] writing code is really valuable right now and it probably will change rapidly. I think people just need a lot of agility. I always use the metaphor where like a bunch of accountants and Microsoft Excel was just invented. Are you going to be the first person who sets down your HP calculator and says, I'm going to learn how to use this tool because it's just a better way of doing what I'm already doing.[01:08:19] Bret: Or are you going to be the one who's like, you know, begrudgingly pulling out their slide rule and HP calculator and saying these kids these days, you know, their Excel, they don't understand, you know, it's been a little bit reductive, but I just feel like the, the probably the best thing all of us can do, not just in software industry, but I do think it's really.[01:08:38] Bret: Kind of interesting just reflection that we're disrupting our own industry as much as anything else with this technology is to lean into the change, try the tools, like install the latest coding assistance, you know, when Oh three mini comes out, write some code with it that you don't want to be the last accountant to embrace Excel.[01:08:57] Bret: You might not have your job anymore, so.[01:08:59] swyx: [01:09:00] We have some personal questions on like how you keep up with AI and you know, all that, all the other stuff. But I also want to, and I'll let you get to your question. I just wanted to say that the analogy that you made on food was really interesting and resonated with me.[01:09:12] swyx: I feel like we are kind of in like an agrarian economy of like a barter economy for intelligence and now we're sort of industrializing intelligence. And I, that really just was an aha[01:09:21] Alessio: moment for me. I just wanted to reflect that. Yeah. How do you think about. The person being replaced by an agent and how agents talk to each other.[01:09:29] Alessio: So even at Sierra today, right, you're building agents that people talk to, but in the future, you're going to have agents that are going to complain about the order they placed to the customer support agents all the way down. Exactly. And you know, you were the CTO of Facebook, you built OpenGraph there.[01:09:44] Alessio: And I think there were a lot of pros, things that were being enabled, then maybe a lot of cons that came out of that. How do you think about how the agent protocols should be built, thinking about all the implications of it, you know, privacy, data, discoverability and all that?[01:09:57] Bret: Yeah, I think it's a little early for a [01:10:00] protocol to emerge.[01:10:00] Bret: I've read about a few of the attempts and maybe some of them will catch on. One of the things that's really interesting about large language models is because they're trained on language as they are very capable of using the interfaces built for us. And so. My intuition right now is that because we can make an interface that works for us and also works for the AI, maybe that's good enough.[01:10:23] Bret: You know, I mean, a little bit hand wavy here, but making a machine protocol for agents that's inaccessible to people, there's some upsides to it, but there's also quite a bit of downside to it as well. I think it was Andrej Karpathy, but I can't remember. But like one of the more well known AI researchers wrote, like I spent half my day writing English, you know, in my software engineering I have an intuition that agents will speak to agents using language for a while.[01:10:53] Bret: I don't know if that's true. But there's a lot of reasons why there, that may be true. And so, you know, [01:11:00] when. Your personal agent speaks to a Sierra agent to help figure out why your Sonos speaker has the flashing orange light. My intuition is it will be in English for a while. And I think there's a lot of, like, benefits to that.[01:11:13] Bret: I do think that we still are in the early days of Like long running agents I don't know if you tried the deep research agent that just came up,[01:11:22] swyx: we have one for you. Oh, that's great.[01:11:25] Bret: It was interesting cause it was probably the first time I really got like notified by open AI when something was done and I brought up before the interactive parts of it.[01:11:34] Bret: That's the area that I'm most interested in right now. It just is like most agentic workflows are relatively short running and. The workflows that are multi stakeholder, long running and multi system we deal with a lot of those and, and at Sierra, but broadly speaking, I think that those are interesting just because I, I always use the metaphor that prior to the mobile phone, every time you got like [01:12:00] a notification from some internet service, you get an email, not because email was like the best way to notify you, but it's the only way.[01:12:08] Bret: And so you know, you used to get tagged on a photo in Facebook and you get an email about it. Then once. This was in everyone's pocket. Every app had equal access to buzzing your pocket. And now, you know, for most of the apps I use, I don't get email notifications. I just get, get it directly from the app.[01:12:25] Bret: I sort of wonder what the form factors will be for agents. How do you address and reach out to other agents? And then how does it bring you the, the operator of the agent into the loop at the right time? You know, I certainly think there's companies like, you know, with chat GPT, that will be one of the major consumer surfaces.[01:12:42] Bret: So there's like, there's a lot of like gravity to those services. But then if I think about sort of domain specific workflows as well, I think there's just a lot to figure out there. So I'm less. The agent agent protocols. I actually think I could be wrong. I just haven't thought about a lot. Like it's sort of interesting, but actually just how it engages with all [01:13:00] the people in it is actually one of the things I'm most interested to sort of see how it plays out as well.[01:13:04] Alessio: Yeah. I think to me, the things that are at the core of it is kind of like our back, you know, it's like, can this agent access this thing? I think in the customer support use cases, maybe less prominent, but like in the enterprises is more interesting. And also like language, like you can compress the language.[01:13:20] Alessio: If the human didn't have to read it, you can kind of save tokens, make things faster. So yeah, you mentioned being notified about deep research. Is there a open AI deep research has been achieved internally notification that goes out to everybody and the board gets summoned and you get to see it. Can you give any backstory on that process?[01:13:40] Bret: OpenAI is a mission driven nonprofit that I think of primarily as a research lab. It's obviously more than that, you know, in some ways like chat GPT is a cultural defining product. But at the end of the day, the mission is to ensure that artificial general intelligence benefits all of humanity. So a lot [01:14:00] of our board discussions are about.[01:14:02] Bret: Research and its implications on humanity, which is primarily safety. Obviously, I think the one cannot achieve AGI and not think about safety as a primary responsibility for that mission, but it's also access and other things. So things like deep research, we definitely talk about because it's a big part of, if you think about what does it mean to build AGI, but we talk about a lot of different things, you know, so it's like Sometimes we hear about things super early.[01:14:26] Bret: Sometimes if it's not really related, if it's sort of far afield from the core of the mission, you know, it's like more casual. So it's pretty fun, fun to be a part of that just because it's my favorite part of every board discussion is just hearing from the researchers about. How they're thinking about the future and just like the next, next milestone and creating AGI.[01:14:44] swyx: Well, lots of milestones. Maybe we'll just start at the beginning. Like, you know, there are very few people that have been in the rooms that you've been in. How do these conversations start? How do you get brought into opening? I obviously there's, there's a bit of drama that you can go into if you want.[01:14:56] swyx: Just take us into the room. Like what happens? What is it [01:15:00] like?[01:15:00] Bret: Was it a. Thursday or Friday when Friday was fired. Yeah. So I heard about it like everyone else, you know, just like saw it on, on social media. And I remember[01:15:12] swyx: where I was walking here and I was[01:15:14] Bret: totally shocked and messaged my co founder clay.[01:15:17] Bret: And I was like, gosh, I wonder what happened. And then. On Saturday, trying to just protect sort of like people's privacy on this. But I ended up talking to both Adam D'Angelo and Sam Altman and basically getting a kind of synopsis of what was going on and my understanding that you could, you'd have to ask them for sort of their perspective on this was just basically like they, both the board and Sam both felt some trust in me.[01:15:44] Bret: And it was a very complicated situation because the, the company was reacted pretty negatively, understandably negatively to Sam's being fired. I don't think they really understood what was going on. And so the board was, you know, in a situation where they needed to sort of figure [01:16:00] out a path forward and they reached out to me and then I talked to Sam and basically ended up kind of the mediator for lack of a better word, not really formally that, but fundamentally that.[01:16:10] Bret: And as the board was trying to figure out a path forward, you know, we, we ended up with a lot of discussions with like how to reinstate Sam is a CEO of the company, but also do a review of what happens so that the board's concerns could be fully sort of adjudicated, you know because they obviously did have concerns going into it.[01:16:29] Bret: So it ended up there. So I think broadly speaking, I was just like a known, like a lot of the stakeholders in it knew of me and, and I'd like to think I have some integrity, so it was just sort of like, you know, they were trying to find a way out of a very complex situation. So I ended up kind of meeting in that and have formed a.[01:16:48] Bret: A really great relationship with Sam and Greg and pretty challenging time for the company didn't plan to be, you know, on the board. I got pulled in because of the crisis that happened. [01:17:00] And I don't think I'll be on the board forever either. I, I posted when I joined that I was going to do it temporarily.[01:17:05] Bret: That was like a year ago. You know, I really like to focus on Sierra, but I also really care about, it's just an amazing mission. So[01:17:15] Navigating High-Stakes Situations[01:17:15] swyx: I've been maybe been in like high stakes situations like that, like twice, but obviously not as high stakes, but like, what principles do you have? When you know, like, this is the highest egos, highest amount of stakes possible, highest amount of money, whatever.[01:17:31] swyx: What principles do you have to go into something like this? Like, obviously you have a great reputation, you have a great network. What are your must do's and what are your must not do's?[01:17:39] Bret: I'm not sure there's a If there were a playbook for these situations, there'd be a lot simpler. You know, I just probably go back to like the way I operate in general.[01:17:49] Bret: One is first principles thinking. So I, I do think that there's crisis playbooks, but there was nothing quite like this and you really need to [01:18:00] understand what's going on and why. I think a lot of. Moments of crisis are fundamentally human problems. You can strategize about people's incentives and this and that and the other thing, but I think it's really important to understand all the people involved and what motivates them and why, which is fundamentally an exercise in empathy.[01:18:18] Bret: Actually. Like, do you really understand. Why people are doing what they're doing and then getting good advice, you know, and I think people What's interesting about a high profile crisis is everyone wants to give you advice So there's no shortage of advice, but the good advice is the one I think that really involves judgment Which is who are people based on first principles analysis of the situation based on your assessment?[01:18:41] Bret: Of what, you know, all the people involved who would have true expertise and good judgment, you know, in these situations so that you can either validate your judgment if you have an intuition or if it's an area that's like a area of like, say, a legal expertise that you're not expert and [01:19:00] you want the best in the world to give you advice.[01:19:02] Bret: And I actually find people often seek out. The wrong people for advice and it's really important in those circumstances.[01:19:08] swyx: Well, I mean, it's super well navigated. I have, I've got one more and then we can sort of move on on this topic. The the, the Microsoft offer was real, right? For Sam and team to move over at some, at one point in that weekend.[01:19:19] Bret: I'm not sure. I was sort of in it from one vantage point, which was actually, it's interesting. It's like, I didn't really have. Particular skin in the game. So like I came up with this, I still don't own any equity in open AI. I was just I was just a meaningful bystander in the process. And the reason I got involved and and it will get to answer your question, but the reason I got involved was just because I cared about open AI.[01:19:44] Bret: So. You know, I had left my job at Salesforce and by coincidence, the next month chat GBT comes out and, you know, I got nerd sniped like everyone else. I'm like, I want to spend my life on this. This is so amazing. And I wouldn't, I don't know if I'd be, I wouldn't, I'm not [01:20:00] sure I would have started another company if not for open AI, kind of inspiring the world with chat GPT, maybe I would have, I don't know, but it was like, it had a very significant impact on you, all of us, I think.[01:20:11] Bret: So the idea that it would dissolve in a weekend just like bothered me a lot. And I'm very, like, I'm very grateful for, for open AI's existence. And, and I, my guess is that is probably shared by a lot of the competing research labs to different degrees too. It's just like it kind of that rising tide lifted all boats.[01:20:27] Bret: Like I think it created the proverbial iPhone moment for AI and, and changed, changed the world. So there were lots of. Microsoft is an investor in open AI. It has a vested interest in it. The Sam and Greg had their interests. The employees had their interests and there's lots of wheeling and dealing.[01:20:49] Bret: And I, you know, you can't AB test decision making. So I don't know if like things had fallen apart with that. I don't, I don't actually know. And you also don't know, like what's real, what's not. I [01:21:00] mean, so you'd have to talk to, to them to know it was really real. So.[01:21:03] swyx: Mentioning advisors. I heard it seems like Brian Armstrong was.[01:21:07] swyx: surprisingly strong advisor on during, during the whole journey, which is[01:21:10] Bret: the my understanding was both Brian Armstrong and Ron Conway were really close to Sam through it. And I ended up talking to him, but also tried to. Talk a lot to the board to, you know, trying to be the mediator. I was trying to, you obviously have a position on it.[01:21:25] Bret: Like, and I, I felt that, you know, from the outside looking in, I just really wanted to understand, like, why did this happen? And the process seemed, you know perhaps, you know, to say the least. But I was trying to remain sort of dispassionate because one of the principles was like, if you want to put Humpty Dumpty back together again, you can't be a single issue voter, right?[01:21:45] Bret: Like you have to go in and say like, so it was a pretty sensitive moment. But yeah, my, I think Brian's one of the great entrepreneurs and a true true, true friend and ally to, to Sam through that he's[01:21:55] swyx: been through a lot. As well. The reason I bring up Microsoft is because, [01:22:00] I mean, obviously Huge Backer.[01:22:01] swyx: We actually talked to David Juan who pitched, I think it was Satya at the time, on on the, the first billion dollar investment in OpenAI. The understanding I had was that the best situation was for Open OpenAI, for Microsoft was open. The As is second best was Microsoft Echo hires Sam and Greg and, and whoever else.[01:22:19] swyx: And that was the relationship at the time. Super close, exclusive relationship and all that. I think now things have evolved a little bit. And you know, with, with the evolution of Stargate and there's some, some uncertainty or FUD about the relationship between Microsoft and OpenAI. And I just wanted to, just kind of bring that up.[01:22:38] swyx: Because like, we're also working, like, one, Satya's, we're fortunate to have Satya as a subscriber to InSpace. And we're working on an interview with him. And we're trying to figure out. How this has evolved now, like what, what is, how would you characterize the relationship between Microsoft and OpenAI?[01:22:52] Bret: Microsoft's, you know, the most important partner of OpenAI, you know, so we have a really like deep relationship with them on many [01:23:00] fronts.[01:23:00] Bret: So I think it's always evolving just because the scale of this market is evolving and in particular the capital requirements for infrastructure. Are well beyond what anyone would have predicted two years ago, let alone whenever the Microsoft relationship started. Well, what was that six years ago? I actually don't, I should know off the top of my head, but it was a long time long in this, in the world of AI, a long, longer time ago.[01:23:24] Bret: I don't really think there's anything to share. I mean, it's I don't, I think the relationships evolved because the markets evolved, but the core tenants of the partnership have remained the same. And it's, you know, by far open eyes, most important partner.[01:23:36] swyx: Just double clicking a little bit more, just like a lot of, obviously a lot of our listeners are, you know, care a lot about the priorities of OpenAI.[01:23:43] swyx: I've had it phrased to me that OpenAI had sort of five Top level priorities, like always have frontier models always be on the frontier sort of efficiency as well. Be the first in sort of multi modality, whether it's video generation or real time voice, anything like that. How would you characterize the top priorities of [01:24:00] OpenAI?[01:24:00] swyx: Apart from just the highest level AGI thing.[01:24:02] Bret: I always come back to the highest level AGI as you put it, it is a mission driven organization. And I think a lot of companies talk about their mission, but OpenAI is literally like the mission defines everything that we do. And I think it is important to understand that if you're trying to like.[01:24:20] Bret: Predict where open AI is going to go, because if it doesn't serve the mission, it's very unlikely that it will be a priority for open AI. You know, it's a big organization, so occasionally you might have like side projects, you're like, you know what, I'm not sure that's going to really serve the mission as much as we thought, like, let's not do it anymore.[01:24:36] Bret: But at the end of the day, like people work at open AI because they believe in the benefits the AGI can have to humanity. Some people are there because they want to build it. And the actual act of building is incredibly intellectually rewarding. Some people are there because they want to ensure that AGI is safe.[01:24:55] Bret: I think we have the best AGI safety team in the world. And there's just [01:25:00] so many interesting research problems to, to tackle there as these models become increasingly capable, as they have access to the internet, it has access to tools. It's just like really interesting stuff, but everyone is there because they're interested in the mission.[01:25:13] Bret: And as a consequence, I think that. You know, if you look at something like deep research, that lens, it's pretty logical, right? It's like, of course, that's if you're going to think about what it means to create AGI, enabling AI to help further the cause of research is, is meaningful. You can see why a lot of the AGI labs are working on.[01:25:34] Bret: Software engineering and code generation, because that seems pretty useful if you're trying to make AGI, right? Just because a huge part of it is, is code, you know to do it. Similarly, as you look at sort of tool use and agents right down the middle of what you need to do AGI, that is the part of the company.[01:25:51] Bret: I don't think there is like a. Top, I mean, sure, there's like a, maybe an operational top 10 list, but it is fundamentally about building AGI and [01:26:00] ensuring AGI benefits all of humanity. And that's all we exist for. And the rest of it is like, not a distraction necessarily, but that's like the only reason the organization exists.[01:26:09] Bret: The thing that I think is remarkable is if I had. Describe that mission to the two of you four years ago, like, you know, one of the interesting things is like, how do you think society would use AI? We'd probably think almost maybe like industrial applications, robots, all these other things. I think chat GPT has been the most.[01:26:26] Bret: Delightful. And it doesn't feel counterintuitive now, but like counterintuitive way to serve that mission, because the idea that you can go to chat, gpt. com and access the most advanced intelligence in the world. And there's like a free tier is like pretty amazing. So actually one of the neat things I think is that chat GPT, you know, famously was a research preview that turned into this brand, you know, industry defining brand.[01:26:54] Bret: I think it is one of the more key parts of the mission in a lot of ways because it is the [01:27:00] way many people will use this intelligence for their everyday use. It's not limited to the few. It's not limited to, you know, a form factor that's inaccessible. So I actually think that. It's been really neat to see how much that has led to there's lots of different contours of the mission of, of AGI, but benefit humanity means everyone can use it.[01:27:21] Bret: And so I do think like to your point on is cost important. Oh yeah. Cost is really important. How can we have all of humanity access AI if it's incredibly expensive and you need the 200 subscription, which I pay for it. Cause I think, you know, one promote is mind blowing, you know, but it's, you want both cause you need the advanced research.[01:27:41] Bret: You also want everyone in the world to benefit. So that's the way, I mean, if you're trying to predict where we're going to go, just think, what would I do if I were running a company to, you know, go build AGI and ensures it benefits humanity. That's, that's how we prioritize everything.[01:27:57] Alessio: I know we're going to wrap up soon.[01:27:58] Alessio: I would love to ask some personal [01:28:00] questions. One, what are maybe. I've been guiding principles for you one and choosing what to do. So, you know, you were Salesforce. You were CTO of Facebook. I'm sure you got it done a lot more things, but those were the choices that you made. Do you have frameworks that you use for that?[01:28:15] Alessio: Yeah, let's start there.[01:28:16] Bret: I try to remain sort of like present and grounded in the moment. So. No, I wish I, I wish I did it more, but I don't I really try to focus on like impact, I guess, on what I work on, but also do I enjoy it? And sometimes I think, yeah, we talked a little bit about, you know, what should an entrepreneur work on if they want to start a business?[01:28:38] Bret: And I was sort of joking around about sometimes like best businesses are passion projects. I definitely take into account both. Like I, I want to have an impact on the world and I also like, want to enjoy building what I'm building. And I wouldn't work on something that was impactful if I didn't enjoy doing it every day.[01:28:55] Bret: And then I try to have some balance in my life. I've got a [01:29:00] family and one of the values of, of Sierra's competitive intensity, but we also have a value called family. And we always like to say. Intensity and balance are compatible. You can be in a really intense person and I don't have a lot of like hobbies.[01:29:18] Bret: I basically just like work and spend time with my family. But I have balanced there. And but I, but I do try to have that balance just because, you know, if you're proverbially, you know, on your deathbed, what do you, what do you want, and I want to be surrounded by people I love and to be proud of the impact that I had.[01:29:35] Alessio: I know you also love to make handmade pasta. I'm Italian, so I would love to hear favorite pasta shapes, maybe sauces. Oh,[01:29:43] Bret: that's good. I don't know where you found that. Was that deep research or whatever? It was deep research. That's a deep[01:29:48] swyx: cut. Sorry, where is this from?[01:29:50] Alessio: It was from,[01:29:51] swyx: from,[01:29:51] Alessio: I[01:29:51] Bret: forget,[01:29:52] Alessio: it was, it was,[01:29:52] Bret: the source was Ling.[01:29:55] Bret: I do love to cook. So I started making pasta when my [01:30:00] kids were little because I found getting them involved in the kitchen made them eat their meals better. So like participating in the act of making the food. Made them appreciate the food more. And so we do a lot of just like spaghetti linguine, just because it's pretty easy to do.[01:30:15] Bret: And the crank is turning and the part of the pasta making for me was like, they could operate the crank and I could put it through and it was very interactive. Sauces. I do a bunch probably, I mean. I, the like really simple marinara with really good tomatoes and it's like just a classic, especially if you're a really good pasta, but I like them all.[01:30:36] Bret: But I mean, I just, you know, that's probably the go to just cause it's easy. So[01:30:40] Alessio: I just said to us when I saw it come up in the research, I was like, I mean, you have to weigh in as the Italian here. Yeah, I would say so. There's one type of spaghetti you called. I like it. That's kind of like they're almost square.[01:30:51] Alessio: Those are really good. We're like you do a cherry tomato sauce with oil. You can put undo again there. Yeah, we can do a different pockets on [01:31:00] head[01:31:00] swyx: of the Italian Tech Mafia. Very, very good restaurants. I highly recommend going to Italian restaurants with him. Yeah. Okay. So my question would be, how do you keep up on the eye?[01:31:10] swyx: There's so much. going on. Do you have some special news resource that you use that no one else has?[01:31:17] Bret: No, but I most mornings I'll try to sort of like read, kind of check out what's going on on social media, just like any buzz around papers. But the thing that I don't The thing I really like, we have a small research team at Sierra and we'll do sessions on interesting papers then.[01:31:36] Bret: I think that's really nice. And, you know, usually it's someone who like really went deep on a paper and kind of does a, you know, you bring your lunch and just kind of do a readout. And I found that to be the most rewarding just because, you know, I love research, but sometimes, you know, some simple concepts are, you know, surrounded by a lot of ornate language and you're like, let's get a few more, you know, Greek letters in there to make it [01:32:00] seem like we did something smart, you know?[01:32:02] Bret: Sometimes just talking it through conceptually, I can grok the, so what, you know, more easily. And so that's also been interesting as well. And then just conversations, you know, I always try to, when someone says something I'm not familiar with, like I've gotten over the feeling dumb thing. I'm like, I don't know what that is.[01:32:20] Bret: Explain it to me. And, and yes, you can sometimes just find neat techniques, new papers, things like that. It's impossible to keep up that, to be honest with you.[01:32:29] swyx: For sure. I mean, if you're struggling, I mean, imagine the rest of us. But like, you know, you, you have really privileged and special conversations.[01:32:36] swyx: What research directions do you think people should pay attention to just based on the buzz you're hearing internally, or, you know,[01:32:42] Bret: This isn't surprising to you or anyone, but I, I think the I think in general, the reasoning models, but it's interesting because two years ago, you know, the chain of thought reasoning paper was pretty important, you know, and in general, chain of thought has always been a meaningful thing from the [01:33:00] time I think it was a Google paper, right?[01:33:01] Bret: If I'm remembering correctly and Google authors. Yeah. And I think that. It has always been a way to get more robust results, you know, from models. What's just really interesting is the combination of distillation and reasoning is making the relative performance. And I'll say actually performance is an ambiguous word, basically the latency of these reasoning models, more reasonable, because if you think about say GPT 4, which was, I think, a huge step change in intelligence, it was.[01:33:33] Bret: Quite slow and quite expensive for a long time. So it limited the applications. Once you got to 4. 0 and 4. 0 mini, you know, it opened the door to a lot of different applications, both for cost and latency. We know one came out really interesting quality wise, but it's quite slow, quite expensive. So just the limited applications.[01:33:52] Bret: Now I just saw like someone post one of they distilled one of the deep seek models and just made it really [01:34:00] small. And, you know, it's doing these chains of thoughts so fast, you know, it's achieving latency numbers. I think sort of similar to like GPT four back in the day. And now all of a sudden you're like, wow, this is really interesting.[01:34:11] Bret: And I just think. Especially if there's lots of people listening who are like applied AI people, it's basically like price performance quality. And for a long, like for a long time, the market's so young, if you, you really had to pick which quadrant you wanted for the use case and. The idea that we'll be able to get like relatively sophisticated reasoning at like oh, three minutes has been amazing.[01:34:34] Bret: If you haven't tried, it's like the speed of it makes me use it so much more than oh, one, just because oh, one, I'd actually often craft my prompts using for, oh, and then put it into a one just because it was so slow, you know, I just didn't want to like the turnaround time. So I'm just really excited about them.[01:34:50] Bret: I think we're in the early days in the same way with the rapid change from GPT three to three, five to four. And you just saw like. Every, and I think with these reasoning [01:35:00] models, just how we're using sort of inference time compute and the techniques around it, the use cases for it, it feels like we're in that kind of Cambrian explosion of ideas and possibilities.[01:35:11] Bret: So I just think it's really exciting. And and certainly if you look at some of the use cases we're talking about, like coding, these are the exact types of domains where these reasoning models. Do and should have better results. And certainly in our domain, there's just some problems that like thinking through more robustly, which we've always done, but it's just been like, these models are just coming out of the box with a lot more batteries included.[01:35:35] Bret: So I'm super excited about them.[01:35:37] Alessio: Any final call to action? Are you hiring, growing the team? More people should use Sierra, obviously.[01:35:42] Bret: We are growing the team and we're hiring software engineers, agent engineers so send me a note, Bret at Sierra dot AI, we're growing like weed. Our engineering team is exclusively in person in San Francisco, though we do have some kind of forward deployed engineers and, and other offices like [01:36:00] London, so[01:36:00] Alessio: awesome.[01:36:01] Alessio: Thank you so much for the time, Bret.[01:36:03] Bret: Thanks for having me. Get full access to Latent.Space at www.latent.space/subscribe
    --------  
    1:36:19
  • Agent Engineering with Pydantic + Graphs — with Samuel Colvin
    Did you know that adding a simple Code Interpreter took o3 from 9.2% to 32% on FrontierMath? The Latent Space crew is hosting a hack night Feb 11th in San Francisco focused on CodeGen use cases, co-hosted with E2B and Edge AGI; watch E2B’s new workshop and RSVP here!We’re happy to announce that today’s guest Samuel Colvin will be teaching his very first Pydantic AI workshop at the newly announced AI Engineer NYC Workshops day on Feb 22! 25 tickets left.If you’re a Python developer, it’s very likely that you’ve heard of Pydantic. Every month, it’s downloaded >300,000,000 times, making it one of the top 25 PyPi packages. OpenAI uses it in its SDK for structured outputs, it’s at the core of FastAPI, and if you’ve followed our AI Engineer Summit conference, Jason Liu of Instructor has given two great talks about it: “Pydantic is all you need” and “Pydantic is STILL all you need”. Now, Samuel Colvin has raised $17M from Sequoia to turn Pydantic from an open source project to a full stack AI engineer platform with Logfire, their observability platform, and PydanticAI, their new agent framework.Logfire: bringing OTEL to AIOpenTelemetry recently merged Semantic Conventions for LLM workloads which provides standard definitions to track performance like gen_ai.server.time_per_output_token. In Sam’s view at least 80% of new apps being built today have some sort of LLM usage in them, and just like web observability platform got replaced by cloud-first ones in the 2010s, Logfire wants to do the same for AI-first apps. If you’re interested in the technical details, Logfire migrated away from Clickhouse to Datafusion for their backend. We spent some time on the importance of picking open source tools you understand and that you can actually contribute to upstream, rather than the more popular ones; listen in ~43:19 for that part.Agents are the killer app for graphsPydantic AI is their attempt at taking a lot of the learnings that LangChain and the other early LLM frameworks had, and putting Python best practices into it. At an API level, it’s very similar to the other libraries: you can call LLMs, create agents, do function calling, do evals, etc.They define an “Agent” as a container with a system prompt, tools, structured result, and an LLM. Under the hood, each Agent is now a graph of function calls that can orchestrate multi-step LLM interactions. You can start simple, then move toward fully dynamic graph-based control flow if needed.“We were compelled enough by graphs once we got them right that our agent implementation [...] is now actually a graph under the hood.”Why Graphs?* More natural for complex or multi-step AI workflows.* Easy to visualize and debug with mermaid diagrams.* Potential for distributed runs, or “waiting days” between steps in certain flows.In parallel, you see folks like Emil Eifrem of Neo4j talk about GraphRAG as another place where graphs fit really well in the AI stack, so it might be time for more people to take them seriously.Full Video EpisodeLike and subscribe!Chapters* 00:00:00 Introductions* 00:00:24 Origins of Pydantic* 00:05:28 Pydantic's AI moment * 00:08:05 Why build a new agents framework?* 00:10:17 Overview of Pydantic AI* 00:12:33 Becoming a believer in graphs* 00:24:02 God Model vs Compound AI Systems* 00:28:13 Why not build an LLM gateway?* 00:31:39 Programmatic testing vs live evals* 00:35:51 Using OpenTelemetry for AI traces* 00:43:19 Why they don't use Clickhouse* 00:48:34 Competing in the observability space* 00:50:41 Licensing decisions for Pydantic and LogFire* 00:51:48 Building Pydantic.run* 00:55:24 Marimo and the future of Jupyter notebooks* 00:57:44 London's AI sceneShow Notes* Sam Colvin* Pydantic* Pydantic AI* Logfire* Pydantic.run* Zod* E2B* Arize* Langsmith* Marimo* Prefect* GLA (Google Generative Language API)* OpenTelemetry* Jason Liu* Sebastian Ramirez* Bogomil Balkansky* Hood Chatham* Jeremy Howard* Andrew LambTranscriptAlessio [00:00:03]: Hey, everyone. Welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI.Swyx [00:00:12]: Good morning. And today we're very excited to have Sam Colvin join us from Pydantic AI. Welcome. Sam, I heard that Pydantic is all we need. Is that true?Samuel [00:00:24]: I would say you might need Pydantic AI and Logfire as well, but it gets you a long way, that's for sure.Swyx [00:00:29]: Pydantic almost basically needs no introduction. It's almost 300 million downloads in December. And obviously, in the previous podcasts and discussions we've had with Jason Liu, he's been a big fan and promoter of Pydantic and AI.Samuel [00:00:45]: Yeah, it's weird because obviously I didn't create Pydantic originally for uses in AI, it predates LLMs. But it's like we've been lucky that it's been picked up by that community and used so widely.Swyx [00:00:58]: Actually, maybe we'll hear it. Right from you, what is Pydantic and maybe a little bit of the origin story?Samuel [00:01:04]: The best name for it, which is not quite right, is a validation library. And we get some tension around that name because it doesn't just do validation, it will do coercion by default. We now have strict mode, so you can disable that coercion. But by default, if you say you want an integer field and you get in a string of 1, 2, 3, it will convert it to 123 and a bunch of other sensible conversions. And as you can imagine, the semantics around it. Exactly when you convert and when you don't, it's complicated, but because of that, it's more than just validation. Back in 2017, when I first started it, the different thing it was doing was using type hints to define your schema. That was controversial at the time. It was genuinely disapproved of by some people. I think the success of Pydantic and libraries like FastAPI that build on top of it means that today that's no longer controversial in Python. And indeed, lots of other people have copied that route, but yeah, it's a data validation library. It uses type hints for the for the most part and obviously does all the other stuff you want, like serialization on top of that. But yeah, that's the core.Alessio [00:02:06]: Do you have any fun stories on how JSON schemas ended up being kind of like the structure output standard for LLMs? And were you involved in any of these discussions? Because I know OpenAI was, you know, one of the early adopters. So did they reach out to you? Was there kind of like a structure output console in open source that people were talking about or was it just a random?Samuel [00:02:26]: No, very much not. So I originally. Didn't implement JSON schema inside Pydantic and then Sebastian, Sebastian Ramirez, FastAPI came along and like the first I ever heard of him was over a weekend. I got like 50 emails from him or 50 like emails as he was committing to Pydantic, adding JSON schema long pre version one. So the reason it was added was for OpenAPI, which is obviously closely akin to JSON schema. And then, yeah, I don't know why it was JSON that got picked up and used by OpenAI. It was obviously very convenient for us. That's because it meant that not only can you do the validation, but because Pydantic will generate you the JSON schema, it will it kind of can be one source of source of truth for structured outputs and tools.Swyx [00:03:09]: Before we dive in further on the on the AI side of things, something I'm mildly curious about, obviously, there's Zod in JavaScript land. Every now and then there is a new sort of in vogue validation library that that takes over for quite a few years and then maybe like some something else comes along. Is Pydantic? Is it done like the core Pydantic?Samuel [00:03:30]: I've just come off a call where we were redesigning some of the internal bits. There will be a v3 at some point, which will not break people's code half as much as v2 as in v2 was the was the massive rewrite into Rust, but also fixing all the stuff that was broken back from like version zero point something that we didn't fix in v1 because it was a side project. We have plans to move some of the basically store the data in Rust types after validation. Not completely. So we're still working to design the Pythonic version of it, in order for it to be able to convert into Python types. So then if you were doing like validation and then serialization, you would never have to go via a Python type we reckon that can give us somewhere between three and five times another three to five times speed up. That's probably the biggest thing. Also, like changing how easy it is to basically extend Pydantic and define how particular types, like for example, NumPy arrays are validated and serialized. But there's also stuff going on. And for example, Jitter, the JSON library in Rust that does the JSON parsing, has SIMD implementation at the moment only for AMD64. So we can add that. We need to go and add SIMD for other instruction sets. So there's a bunch more we can do on performance. I don't think we're going to go and revolutionize Pydantic, but it's going to continue to get faster, continue, hopefully, to allow people to do more advanced things. We might add a binary format like CBOR for serialization for when you'll just want to put the data into a database and probably load it again from Pydantic. So there are some things that will come along, but for the most part, it should just get faster and cleaner.Alessio [00:05:04]: From a focus perspective, I guess, as a founder too, how did you think about the AI interest rising? And then how do you kind of prioritize, okay, this is worth going into more, and we'll talk about Pydantic AI and all of that. What was maybe your early experience with LLAMP, and when did you figure out, okay, this is something we should take seriously and focus more resources on it?Samuel [00:05:28]: I'll answer that, but I'll answer what I think is a kind of parallel question, which is Pydantic's weird, because Pydantic existed, obviously, before I was starting a company. I was working on it in my spare time, and then beginning of 22, I started working on the rewrite in Rust. And I worked on it full-time for a year and a half, and then once we started the company, people came and joined. And it was a weird project, because that would never go away. You can't get signed off inside a startup. Like, we're going to go off and three engineers are going to work full-on for a year in Python and Rust, writing like 30,000 lines of Rust just to release open-source-free Python library. The result of that has been excellent for us as a company, right? As in, it's made us remain entirely relevant. And it's like, Pydantic is not just used in the SDKs of all of the AI libraries, but I can't say which one, but one of the big foundational model companies, when they upgraded from Pydantic v1 to v2, their number one internal model... The metric of performance is time to first token. That went down by 20%. So you think about all of the actual AI going on inside, and yet at least 20% of the CPU, or at least the latency inside requests was actually Pydantic, which shows like how widely it's used. So we've benefited from doing that work, although it didn't, it would have never have made financial sense in most companies. In answer to your question about like, how do we prioritize AI, I mean, the honest truth is we've spent a lot of the last year and a half building. Good general purpose observability inside LogFire and making Pydantic good for general purpose use cases. And the AI has kind of come to us. Like we just, not that we want to get away from it, but like the appetite, uh, both in Pydantic and in LogFire to go and build with AI is enormous because it kind of makes sense, right? Like if you're starting a new greenfield project in Python today, what's the chance that you're using GenAI 80%, let's say, globally, obviously it's like a hundred percent in California, but even worldwide, it's probably 80%. Yeah. And so everyone needs that stuff. And there's so much yet to be figured out so much like space to do things better in the ecosystem in a way that like to go and implement a database that's better than Postgres is a like Sisyphean task. Whereas building, uh, tools that are better for GenAI than some of the stuff that's about now is not very difficult. Putting the actual models themselves to one side.Alessio [00:07:40]: And then at the same time, then you released Pydantic AI recently, which is, uh, um, you know, agent framework and early on, I would say everybody like, you know, Langchain and like, uh, Pydantic kind of like a first class support, a lot of these frameworks, we're trying to use you to be better. What was the decision behind we should do our own framework? Were there any design decisions that you disagree with any workloads that you think people didn't support? Well,Samuel [00:08:05]: it wasn't so much like design and workflow, although I think there were some, some things we've done differently. Yeah. I think looking in general at the ecosystem of agent frameworks, the engineering quality is far below that of the rest of the Python ecosystem. There's a bunch of stuff that we have learned how to do over the last 20 years of building Python libraries and writing Python code that seems to be abandoned by people when they build agent frameworks. Now I can kind of respect that, particularly in the very first agent frameworks, like Langchain, where they were literally figuring out how to go and do this stuff. It's completely understandable that you would like basically skip some stuff.Samuel [00:08:42]: I'm shocked by the like quality of some of the agent frameworks that have come out recently from like well-respected names, which it just seems to be opportunism and I have little time for that, but like the early ones, like I think they were just figuring out how to do stuff and just as lots of people have learned from Pydantic, we were able to learn a bit from them. I think from like the gap we saw and the thing we were frustrated by was the production readiness. And that means things like type checking, even if type checking makes it hard. Like Pydantic AI, I will put my hand up now and say it has a lot of generics and you need to, it's probably easier to use it if you've written a bit of Rust and you really understand generics, but like, and that is, we're not claiming that that makes it the easiest thing to use in all cases, we think it makes it good for production applications in big systems where type checking is a no-brainer in Python. But there are also a bunch of stuff we've learned from maintaining Pydantic over the years that we've gone and done. So every single example in Pydantic AI's documentation is run on Python. As part of tests and every single print output within an example is checked during tests. So it will always be up to date. And then a bunch of things that, like I say, are standard best practice within the rest of the Python ecosystem, but I'm not followed surprisingly by some AI libraries like coverage, linting, type checking, et cetera, et cetera, where I think these are no-brainers, but like weirdly they're not followed by some of the other libraries.Alessio [00:10:04]: And can you just give an overview of the framework itself? I think there's kind of like the. LLM calling frameworks, there are the multi-agent frameworks, there's the workflow frameworks, like what does Pydantic AI do?Samuel [00:10:17]: I glaze over a bit when I hear all of the different sorts of frameworks, but I like, and I will tell you when I built Pydantic, when I built Logfire and when I built Pydantic AI, my methodology is not to go and like research and review all of the other things. I kind of work out what I want and I go and build it and then feedback comes and we adjust. So the fundamental building block of Pydantic AI is agents. The exact definition of agents and how you want to define them. is obviously ambiguous and our things are probably sort of agent-lit, not that we would want to go and rename them to agent-lit, but like the point is you probably build them together to build something and most people will call an agent. So an agent in our case has, you know, things like a prompt, like system prompt and some tools and a structured return type if you want it, that covers the vast majority of cases. There are situations where you want to go further and the most complex workflows where you want graphs and I resisted graphs for quite a while. I was sort of of the opinion you didn't need them and you could use standard like Python flow control to do all of that stuff. I had a few arguments with people, but I basically came around to, yeah, I can totally see why graphs are useful. But then we have the problem that by default, they're not type safe because if you have a like add edge method where you give the names of two different edges, there's no type checking, right? Even if you go and do some, I'm not, not all the graph libraries are AI specific. So there's a, there's a graph library called, but it allows, it does like a basic runtime type checking. Ironically using Pydantic to try and make up for the fact that like fundamentally that graphs are not typed type safe. Well, I like Pydantic, but it did, that's not a real solution to have to go and run the code to see if it's safe. There's a reason that starting type checking is so powerful. And so we kind of, from a lot of iteration eventually came up with a system of using normally data classes to define nodes where you return the next node you want to call and where we're able to go and introspect the return type of a node to basically build the graph. And so the graph is. Yeah. Inherently type safe. And once we got that right, I, I wasn't, I'm incredibly excited about graphs. I think there's like masses of use cases for them, both in gen AI and other development, but also software's all going to have interact with gen AI, right? It's going to be like web. There's no longer be like a web department in a company is that there's just like all the developers are building for web building with databases. The same is going to be true for gen AI.Alessio [00:12:33]: Yeah. I see on your docs, you call an agent, a container that contains a system prompt function. Tools, structure, result, dependency type model, and then model settings. Are the graphs in your mind, different agents? Are they different prompts for the same agent? What are like the structures in your mind?Samuel [00:12:52]: So we were compelled enough by graphs once we got them right, that we actually merged the PR this morning. That means our agent implementation without changing its API at all is now actually a graph under the hood as it is built using our graph library. So graphs are basically a lower level tool that allow you to build these complex workflows. Our agents are technically one of the many graphs you could go and build. And we just happened to build that one for you because it's a very common, commonplace one. But obviously there are cases where you need more complex workflows where the current agent assumptions don't work. And that's where you can then go and use graphs to build more complex things.Swyx [00:13:29]: You said you were cynical about graphs. What changed your mind specifically?Samuel [00:13:33]: I guess people kept giving me examples of things that they wanted to use graphs for. And my like, yeah, but you could do that in standard flow control in Python became a like less and less compelling argument to me because I've maintained those systems that end up with like spaghetti code. And I could see the appeal of this like structured way of defining the workflow of my code. And it's really neat that like just from your code, just from your type hints, you can get out a mermaid diagram that defines exactly what can go and happen.Swyx [00:14:00]: Right. Yeah. You do have very neat implementation of sort of inferring the graph from type hints, I guess. Yeah. Is what I would call it. Yeah. I think the question always is I have gone back and forth. I used to work at Temporal where we would actually spend a lot of time complaining about graph based workflow solutions like AWS step functions. And we would actually say that we were better because you could use normal control flow that you already knew and worked with. Yours, I guess, is like a little bit of a nice compromise. Like it looks like normal Pythonic code. But you just have to keep in mind what the type hints actually mean. And that's what we do with the quote unquote magic that the graph construction does.Samuel [00:14:42]: Yeah, exactly. And if you look at the internal logic of actually running a graph, it's incredibly simple. It's basically call a node, get a node back, call that node, get a node back, call that node. If you get an end, you're done. We will add in soon support for, well, basically storage so that you can store the state between each node that's run. And then the idea is you can then distribute the graph and run it across computers. And also, I mean, the other weird, the other bit that's really valuable is across time. Because it's all very well if you look at like lots of the graph examples that like Claude will give you. If it gives you an example, it gives you this lovely enormous mermaid chart of like the workflow, for example, managing returns if you're an e-commerce company. But what you realize is some of those lines are literally one function calls another function. And some of those lines are wait six days for the customer to print their like piece of paper and put it in the post. And if you're writing like your demo. Project or your like proof of concept, that's fine because you can just say, and now we call this function. But when you're building when you're in real in real life, that doesn't work. And now how do we manage that concept to basically be able to start somewhere else in the in our code? Well, this graph implementation makes it incredibly easy because you just pass the node that is the start point for carrying on the graph and it continues to run. So it's things like that where I was like, yeah, I can just imagine how things I've done in the past would be fundamentally easier to understand if we had done them with graphs.Swyx [00:16:07]: You say imagine, but like right now, this pedantic AI actually resume, you know, six days later, like you said, or is this just like a theoretical thing we can go someday?Samuel [00:16:16]: I think it's basically Q&A. So there's an AI that's asking the user a question and effectively you then call the CLI again to continue the conversation. And it basically instantiates the node and calls the graph with that node again. Now, we don't have the logic yet for effectively storing state in the database between individual nodes that we're going to add soon. But like the rest of it is basically there.Swyx [00:16:37]: It does make me think that not only are you competing with Langchain now and obviously Instructor, and now you're going into sort of the more like orchestrated things like Airflow, Prefect, Daxter, those guys.Samuel [00:16:52]: Yeah, I mean, we're good friends with the Prefect guys and Temporal have the same investors as us. And I'm sure that my investor Bogomol would not be too happy if I was like, oh, yeah, by the way, as well as trying to take on Datadog. We're also going off and trying to take on Temporal and everyone else doing that. Obviously, we're not doing all of the infrastructure of deploying that right yet, at least. We're, you know, we're just building a Python library. And like what's crazy about our graph implementation is, sure, there's a bit of magic in like introspecting the return type, you know, extracting things from unions, stuff like that. But like the actual calls, as I say, is literally call a function and get back a thing and call that. It's like incredibly simple and therefore easy to maintain. The question is, how useful is it? Well, I don't know yet. I think we have to go and find out. We have a whole. We've had a slew of people joining our Slack over the last few days and saying, tell me how good Pydantic AI is. How good is Pydantic AI versus Langchain? And I refuse to answer. That's your job to go and find that out. Not mine. We built a thing. I'm compelled by it, but I'm obviously biased. The ecosystem will work out what the useful tools are.Swyx [00:17:52]: Bogomol was my board member when I was at Temporal. And I think I think just generally also having been a workflow engine investor and participant in this space, it's a big space. Like everyone needs different functions. I think the one thing that I would say like yours, you know, as a library, you don't have that much control of it over the infrastructure. I do like the idea that each new agents or whatever or unit of work, whatever you call that should spin up in this sort of isolated boundaries. Whereas yours, I think around everything runs in the same process. But you ideally want to sort of spin out its own little container of things.Samuel [00:18:30]: I agree with you a hundred percent. And we will. It would work now. Right. As in theory, you're just like as long as you can serialize the calls to the next node, you just have to all of the different containers basically have to have the same the same code. I mean, I'm super excited about Cloudflare workers running Python and being able to install dependencies. And if Cloudflare could only give me my invitation to the private beta of that, we would be exploring that right now because I'm super excited about that as a like compute level for some of this stuff where exactly what you're saying, basically. You can run everything as an individual. Like worker function and distribute it. And it's resilient to failure, et cetera, et cetera.Swyx [00:19:08]: And it spins up like a thousand instances simultaneously. You know, you want it to be sort of truly serverless at once. Actually, I know we have some Cloudflare friends who are listening, so hopefully they'll get in front of the line. Especially.Samuel [00:19:19]: I was in Cloudflare's office last week shouting at them about other things that frustrate me. I have a love-hate relationship with Cloudflare. Their tech is awesome. But because I use it the whole time, I then get frustrated. So, yeah, I'm sure I will. I will. I will get there soon.Swyx [00:19:32]: There's a side tangent on Cloudflare. Is Python supported at full? I actually wasn't fully aware of what the status of that thing is.Samuel [00:19:39]: Yeah. So Pyodide, which is Python running inside the browser in scripting, is supported now by Cloudflare. They basically, they're having some struggles working out how to manage, ironically, dependencies that have binaries, in particular, Pydantic. Because these workers where you can have thousands of them on a given metal machine, you don't want to have a difference. You basically want to be able to have a share. Shared memory for all the different Pydantic installations, effectively. That's the thing they work out. They're working out. But Hood, who's my friend, who is the primary maintainer of Pyodide, works for Cloudflare. And that's basically what he's doing, is working out how to get Python running on Cloudflare's network.Swyx [00:20:19]: I mean, the nice thing is that your binary is really written in Rust, right? Yeah. Which also compiles the WebAssembly. Yeah. So maybe there's a way that you'd build... You have just a different build of Pydantic and that ships with whatever your distro for Cloudflare workers is.Samuel [00:20:36]: Yes, that's exactly what... So Pyodide has builds for Pydantic Core and for things like NumPy and basically all of the popular binary libraries. Yeah. It's just basic. And you're doing exactly that, right? You're using Rust to compile the WebAssembly and then you're calling that shared library from Python. And it's unbelievably complicated, but it works. Okay.Swyx [00:20:57]: Staying on graphs a little bit more, and then I wanted to go to some of the other features that you have in Pydantic AI. I see in your docs, there are sort of four levels of agents. There's single agents, there's agent delegation, programmatic agent handoff. That seems to be what OpenAI swarms would be like. And then the last one, graph-based control flow. Would you say that those are sort of the mental hierarchy of how these things go?Samuel [00:21:21]: Yeah, roughly. Okay.Swyx [00:21:22]: You had some expression around OpenAI swarms. Well.Samuel [00:21:25]: And indeed, OpenAI have got in touch with me and basically, maybe I'm not supposed to say this, but basically said that Pydantic AI looks like what swarms would become if it was production ready. So, yeah. I mean, like, yeah, which makes sense. Awesome. Yeah. I mean, in fact, it was specifically saying, how can we give people the same feeling that they were getting from swarms that led us to go and implement graphs? Because my, like, just call the next agent with Python code was not a satisfactory answer to people. So it was like, okay, we've got to go and have a better answer for that. It's not like, let us to get to graphs. Yeah.Swyx [00:21:56]: I mean, it's a minimal viable graph in some sense. What are the shapes of graphs that people should know? So the way that I would phrase this is I think Anthropic did a very good public service and also kind of surprisingly influential blog post, I would say, when they wrote Building Effective Agents. We actually have the authors coming to speak at my conference in New York, which I think you're giving a workshop at. Yeah.Samuel [00:22:24]: I'm trying to work it out. But yes, I think so.Swyx [00:22:26]: Tell me if you're not. yeah, I mean, like, that was the first, I think, authoritative view of, like, what kinds of graphs exist in agents and let's give each of them a name so that everyone is on the same page. So I'm just kind of curious if you have community names or top five patterns of graphs.Samuel [00:22:44]: I don't have top five patterns of graphs. I would love to see what people are building with them. But like, it's been it's only been a couple of weeks. And of course, there's a point is that. Because they're relatively unopinionated about what you can go and do with them. They don't suit them. Like, you can go and do lots of lots of things with them, but they don't have the structure to go and have like specific names as much as perhaps like some other systems do. I think what our agents are, which have a name and I can't remember what it is, but this basically system of like, decide what tool to call, go back to the center, decide what tool to call, go back to the center and then exit. One form of graph, which, as I say, like our agents are effectively one implementation of a graph, which is why under the hood they are now using graphs. And it'll be interesting to see over the next few years whether we end up with these like predefined graph names or graph structures or whether it's just like, yep, I built a graph or whether graphs just turn out not to match people's mental image of what they want and die away. We'll see.Swyx [00:23:38]: I think there is always appeal. Every developer eventually gets graph religion and goes, oh, yeah, everything's a graph. And then they probably over rotate and go go too far into graphs. And then they have to learn a whole bunch of DSLs. And then they're like, actually, I didn't need that. I need this. And they scale back a little bit.Samuel [00:23:55]: I'm at the beginning of that process. I'm currently a graph maximalist, although I haven't actually put any into production yet. But yeah.Swyx [00:24:02]: This has a lot of philosophical connections with other work coming out of UC Berkeley on compounding AI systems. I don't know if you know of or care. This is the Gartner world of things where they need some kind of industry terminology to sell it to enterprises. I don't know if you know about any of that.Samuel [00:24:24]: I haven't. I probably should. I should probably do it because I should probably get better at selling to enterprises. But no, no, I don't. Not right now.Swyx [00:24:29]: This is really the argument is that instead of putting everything in one model, you have more control and more maybe observability to if you break everything out into composing little models and changing them together. And obviously, then you need an orchestration framework to do that. Yeah.Samuel [00:24:47]: And it makes complete sense. And one of the things we've seen with agents is they work well when they work well. But when they. Even if you have the observability through log five that you can see what was going on, if you don't have a nice hook point to say, hang on, this is all gone wrong. You have a relatively blunt instrument of basically erroring when you exceed some kind of limit. But like what you need to be able to do is effectively iterate through these runs so that you can have your own control flow where you're like, OK, we've gone too far. And that's where one of the neat things about our graph implementation is you can basically call next in a loop rather than just running the full graph. And therefore, you have this opportunity to to break out of it. But yeah, basically, it's the same point, which is like if you have two bigger unit of work to some extent, whether or not it involves gen AI. But obviously, it's particularly problematic in gen AI. You only find out afterwards when you've spent quite a lot of time and or money when it's gone off and done done the wrong thing.Swyx [00:25:39]: Oh, drop on this. We're not going to resolve this here, but I'll drop this and then we can move on to the next thing. This is the common way that we we developers talk about this. And then the machine learning researchers look at us. And laugh and say, that's cute. And then they just train a bigger model and they wipe us out in the next training run. So I think there's a certain amount of we are fighting the bitter lesson here. We're fighting AGI. And, you know, when AGI arrives, this will all go away. Obviously, on Latent Space, we don't really discuss that because I think AGI is kind of this hand wavy concept that isn't super relevant. But I think we have to respect that. For example, you could do a chain of thoughts with graphs and you could manually orchestrate a nice little graph that does like. Reflect, think about if you need more, more inference time, compute, you know, that's the hot term now. And then think again and, you know, scale that up. Or you could train Strawberry and DeepSeq R1. Right.Samuel [00:26:32]: I saw someone saying recently, oh, they were really optimistic about agents because models are getting faster exponentially. And I like took a certain amount of self-control not to describe that it wasn't exponential. But my main point was. If models are getting faster as quickly as you say they are, then we don't need agents and we don't really need any of these abstraction layers. We can just give our model and, you know, access to the Internet, cross our fingers and hope for the best. Agents, agent frameworks, graphs, all of this stuff is basically making up for the fact that right now the models are not that clever. In the same way that if you're running a customer service business and you have loads of people sitting answering telephones, the less well trained they are, the less that you trust them, the more that you need to give them a script to go through. Whereas, you know, so if you're running a bank and you have lots of customer service people who you don't trust that much, then you tell them exactly what to say. If you're doing high net worth banking, you just employ people who you think are going to be charming to other rich people and set them off to go and have coffee with people. Right. And the same is true of models. The more intelligent they are, the less we need to tell them, like structure what they go and do and constrain the routes in which they take.Swyx [00:27:42]: Yeah. Yeah. Agree with that. So I'm happy to move on. So the other parts of Pydantic AI that are worth commenting on, and this is like my last rant, I promise. So obviously, every framework needs to do its sort of model adapter layer, which is, oh, you can easily swap from OpenAI to Cloud to Grok. You also have, which I didn't know about, Google GLA, which I didn't really know about until I saw this in your docs, which is generative language API. I assume that's AI Studio? Yes.Samuel [00:28:13]: Google don't have good names for it. So Vertex is very clear. That seems to be the API that like some of the things use, although it returns 503 about 20% of the time. So... Vertex? No. Vertex, fine. But the... Oh, oh. GLA. Yeah. Yeah.Swyx [00:28:28]: I agree with that.Samuel [00:28:29]: So we have, again, another example of like, well, I think we go the extra mile in terms of engineering is we run on every commit, at least commit to main, we run tests against the live models. Not lots of tests, but like a handful of them. Oh, okay. And we had a point last week where, yeah, GLA is a little bit better. GLA1 was failing every single run. One of their tests would fail. And we, I think we might even have commented out that one at the moment. So like all of the models fail more often than you might expect, but like that one seems to be particularly likely to fail. But Vertex is the same API, but much more reliable.Swyx [00:29:01]: My rant here is that, you know, versions of this appear in Langchain and every single framework has to have its own little thing, a version of that. I would put to you, and then, you know, this is, this can be agree to disagree. This is not needed in Pydantic AI. I would much rather you adopt a layer like Lite LLM or what's the other one in JavaScript port key. And that's their job. They focus on that one thing and they, they normalize APIs for you. All new models are automatically added and you don't have to duplicate this inside of your framework. So for example, if I wanted to use deep seek, I'm out of luck because Pydantic AI doesn't have deep seek yet.Samuel [00:29:38]: Yeah, it does.Swyx [00:29:39]: Oh, it does. Okay. I'm sorry. But you know what I mean? Should this live in your code or should it live in a layer that's kind of your API gateway that's a defined piece of infrastructure that people have?Samuel [00:29:49]: And I think if a company who are well known, who are respected by everyone had come along and done this at the right time, maybe we should have done it a year and a half ago and said, we're going to be the universal AI layer. That would have been a credible thing to do. I've heard varying reports of Lite LLM is the truth. And it didn't seem to have exactly the type safety that we needed. Also, as I understand it, and again, I haven't looked into it in great detail. Part of their business model is proxying the request through their, through their own system to do the generalization. That would be an enormous put off to an awful lot of people. Honestly, the truth is I don't think it is that much work unifying the model. I get where you're coming from. I kind of see your point. I think the truth is that everyone is centralizing around open AIs. Open AI's API is the one to do. So DeepSeq support that. Grok with OK support that. Ollama also does it. I mean, if there is that library right now, it's more or less the open AI SDK. And it's very high quality. It's well type checked. It uses Pydantic. So I'm biased. But I mean, I think it's pretty well respected anyway.Swyx [00:30:57]: There's different ways to do this. Because also, it's not just about normalizing the APIs. You have to do secret management and all that stuff.Samuel [00:31:05]: Yeah. And there's also. There's Vertex and Bedrock, which to one extent or another, effectively, they host multiple models, but they don't unify the API. But they do unify the auth, as I understand it. Although we're halfway through doing Bedrock. So I don't know about it that well. But they're kind of weird hybrids because they support multiple models. But like I say, the auth is centralized.Swyx [00:31:28]: Yeah, I'm surprised they don't unify the API. That seems like something that I would do. You know, we can discuss all this all day. There's a lot of APIs. I agree.Samuel [00:31:36]: It would be nice if there was a universal one that we didn't have to go and build.Alessio [00:31:39]: And I guess the other side of, you know, routing model and picking models like evals. How do you actually figure out which one you should be using? I know you have one. First of all, you have very good support for mocking in unit tests, which is something that a lot of other frameworks don't do. So, you know, my favorite Ruby library is VCR because it just, you know, it just lets me store the HTTP requests and replay them. That part I'll kind of skip. I think you are busy like this test model. We're like just through Python. You try and figure out what the model might respond without actually calling the model. And then you have the function model where people can kind of customize outputs. Any other fun stories maybe from there? Or is it just what you see is what you get, so to speak?Samuel [00:32:18]: On those two, I think what you see is what you get. On the evals, I think watch this space. I think it's something that like, again, I was somewhat cynical about for some time. Still have my cynicism about some of the well, it's unfortunate that so many different things are called evals. It would be nice if we could agree. What they are and what they're not. But look, I think it's a really important space. I think it's something that we're going to be working on soon, both in Pydantic AI and in LogFire to try and support better because it's like it's an unsolved problem.Alessio [00:32:45]: Yeah, you do say in your doc that anyone who claims to know for sure exactly how your eval should be defined can safely be ignored.Samuel [00:32:52]: We'll delete that sentence when we tell people how to do their evals.Alessio [00:32:56]: Exactly. I was like, we need we need a snapshot of this today. And so let's talk about eval. So there's kind of like the vibe. Yeah. So you have evals, which is what you do when you're building. Right. Because you cannot really like test it that many times to get statistical significance. And then there's the production eval. So you also have LogFire, which is kind of like your observability product, which I tried before. It's very nice. What are some of the learnings you've had from building an observability tool for LEMPs? And yeah, as people think about evals, even like what are the right things to measure? What are like the right number of samples that you need to actually start making decisions?Samuel [00:33:33]: I'm not the best person to answer that is the truth. So I'm not going to come in here and tell you that I think I know the answer on the exact number. I mean, we can do some back of the envelope statistics calculations to work out that like having 30 probably gets you most of the statistical value of having 200 for, you know, by definition, 15% of the work. But the exact like how many examples do you need? For example, that's a much harder question to answer because it's, you know, it's deep within the how models operate in terms of LogFire. One of the reasons we built LogFire the way we have and we allow you to write SQL directly against your data and we're trying to build the like powerful fundamentals of observability is precisely because we know we don't know the answers. And so allowing people to go and innovate on how they're going to consume that stuff and how they're going to process it is we think that's valuable. Because even if we come along and offer you an evals framework on top of LogFire, it won't be right in all regards. And we want people to be able to go and innovate and being able to write their own SQL connected to the API. And effectively query the data like it's a database with SQL allows people to innovate on that stuff. And that's what allows us to do it as well. I mean, we do a bunch of like testing what's possible by basically writing SQL directly against LogFire as any user could. I think the other the other really interesting bit that's going on in observability is OpenTelemetry is centralizing around semantic attributes for GenAI. So it's a relatively new project. A lot of it's still being added at the moment. But basically the idea that like. They unify how both SDKs and or agent frameworks send observability data to to any OpenTelemetry endpoint. And so, again, we can go and having that unification allows us to go and like basically compare different libraries, compare different models much better. That stuff's in a very like early stage of development. One of the things we're going to be working on pretty soon is basically, I suspect, GenAI will be the first agent framework that implements those semantic attributes properly. Because, again, we control and we can say this is important for observability, whereas most of the other agent frameworks are not maintained by people who are trying to do observability. With the exception of Langchain, where they have the observability platform, but they chose not to go down the OpenTelemetry route. So they're like plowing their own furrow. And, you know, they're a lot they're even further away from standardization.Alessio [00:35:51]: Can you maybe just give a quick overview of how OTEL ties into the AI workflows? There's kind of like the question of is, you know, a trace. And a span like a LLM call. Is it the agent? It's kind of like the broader thing you're tracking. How should people think about it?Samuel [00:36:06]: Yeah, so they have a PR that I think may have now been merged from someone at IBM talking about remote agents and trying to support this concept of remote agents within GenAI. I'm not particularly compelled by that because I don't think that like that's actually by any means the common use case. But like, I suppose it's fine for it to be there. The majority of the stuff in OTEL is basically defining how you would instrument. A given call to an LLM. So basically the actual LLM call, what data you would send to your telemetry provider, how you would structure that. Apart from this slightly odd stuff on remote agents, most of the like agent level consideration is not yet implemented in is not yet decided effectively. And so there's a bit of ambiguity. Obviously, what's good about OTEL is you can in the end send whatever attributes you like. But yeah, there's quite a lot of churn in that space and exactly how we store the data. I think that one of the most interesting things, though, is that if you think about observability. Traditionally, it was sure everyone would say our observability data is very important. We must keep it safe. But actually, companies work very hard to basically not have anything that sensitive in their observability data. So if you're a doctor in a hospital and you search for a drug for an STI, the sequel might be sent to the observability provider. But none of the parameters would. It wouldn't have the patient number or their name or the drug. With GenAI, that distinction doesn't exist because it's all just messed up in the text. If you have that same patient asking an LLM how to. What drug they should take or how to stop smoking. You can't extract the PII and not send it to the observability platform. So the sensitivity of the data that's going to end up in observability platforms is going to be like basically different order of magnitude to what's in what you would normally send to Datadog. Of course, you can make a mistake and send someone's password or their card number to Datadog. But that would be seen as a as a like mistake. Whereas in GenAI, a lot of data is going to be sent. And I think that's why companies like Langsmith and are trying hard to offer observability. On prem, because there's a bunch of companies who are happy for Datadog to be cloud hosted, but want self-hosted self-hosting for this observability stuff with GenAI.Alessio [00:38:09]: And are you doing any of that today? Because I know in each of the spans you have like the number of tokens, you have the context, you're just storing everything. And then you're going to offer kind of like a self-hosting for the platform, basically. Yeah. Yeah.Samuel [00:38:23]: So we have scrubbing roughly equivalent to what the other observability platforms have. So if we, you know, if we see password as the key, we won't send the value. But like, like I said, that doesn't really work in GenAI. So we're accepting we're going to have to store a lot of data and then we'll offer self-hosting for those people who can afford it and who need it.Alessio [00:38:42]: And then this is, I think, the first time that most of the workloads performance is depending on a third party. You know, like if you're looking at Datadog data, usually it's your app that is driving the latency and like the memory usage and all of that. Here you're going to have spans that maybe take a long time to perform because the GLA API is not working or because OpenAI is kind of like overwhelmed. Do you do anything there since like the provider is almost like the same across customers? You know, like, are you trying to surface these things for people and say, hey, this was like a very slow span, but actually all customers using OpenAI right now are seeing the same thing. So maybe don't worry about it or.Samuel [00:39:20]: Not yet. We do a few things that people don't generally do in OTA. So we send. We send information at the beginning. At the beginning of a trace as well as sorry, at the beginning of a span, as well as when it finishes. By default, OTA only sends you data when the span finishes. So if you think about a request which might take like 20 seconds, even if some of the intermediate spans finished earlier, you can't basically place them on the page until you get the top level span. And so if you're using standard OTA, you can't show anything until those requests are finished. When those requests are taking a few hundred milliseconds, it doesn't really matter. But when you're doing Gen AI calls or when you're like running a batch job that might take 30 minutes. That like latency of not being able to see the span is like crippling to understanding your application. And so we've we do a bunch of slightly complex stuff to basically send data about a span as it starts, which is closely related. Yeah.Alessio [00:40:09]: Any thoughts on all the other people trying to build on top of OpenTelemetry in different languages, too? There's like the OpenLEmetry project, which doesn't really roll off the tongue. But how do you see the future of these kind of tools? Is everybody going to have to build? Why does everybody want to build? They want to build their own open source observability thing to then sell?Samuel [00:40:29]: I mean, we are not going off and trying to instrument the likes of the OpenAI SDK with the new semantic attributes, because at some point that's going to happen and it's going to live inside OTEL and we might help with it. But we're a tiny team. We don't have time to go and do all of that work. So OpenLEmetry, like interesting project. But I suspect eventually most of those semantic like that instrumentation of the big of the SDKs will live, like I say, inside the main OpenTelemetry report. I suppose. What happens to the agent frameworks? What data you basically need at the framework level to get the context is kind of unclear. I don't think we know the answer yet. But I mean, I was on the, I guess this is kind of semi-public, because I was on the call with the OpenTelemetry call last week talking about GenAI. And there was someone from Arize talking about the challenges they have trying to get OpenTelemetry data out of Langchain, where it's not like natively implemented. And obviously they're having quite a tough time. And I was realizing, hadn't really realized this before, but how lucky we are to primarily be talking about our own agent framework, where we have the control rather than trying to go and instrument other people's.Swyx [00:41:36]: Sorry, I actually didn't know about this semantic conventions thing. It looks like, yeah, it's merged into main OTel. What should people know about this? I had never heard of it before.Samuel [00:41:45]: Yeah, I think it looks like a great start. I think there's some unknowns around how you send the messages that go back and forth, which is kind of the most important part. It's the most important thing of all. And that is moved out of attributes and into OTel events. OTel events in turn are moving from being on a span to being their own top-level API where you send data. So there's a bunch of churn still going on. I'm impressed by how fast the OTel community is moving on this project. I guess they, like everyone else, get that this is important, and it's something that people are crying out to get instrumentation off. So I'm kind of pleasantly surprised at how fast they're moving, but it makes sense.Swyx [00:42:25]: I'm just kind of browsing through the specification. I can already see that this basically bakes in whatever the previous paradigm was. So now they have genai.usage.prompt tokens and genai.usage.completion tokens. And obviously now we have reasoning tokens as well. And then only one form of sampling, which is top-p. You're basically baking in or sort of reifying things that you think are important today, but it's not a super foolproof way of doing this for the future. Yeah.Samuel [00:42:54]: I mean, that's what's neat about OTel is you can always go and send another attribute and that's fine. It's just there are a bunch that are agreed on. But I would say, you know, to come back to your previous point about whether or not we should be relying on one centralized abstraction layer, this stuff is moving so fast that if you start relying on someone else's standard, you risk basically falling behind because you're relying on someone else to keep things up to date.Swyx [00:43:14]: Or you fall behind because you've got other things going on.Samuel [00:43:17]: Yeah, yeah. That's fair. That's fair.Swyx [00:43:19]: Any other observations just about building LogFire, actually? Let's just talk about this. So you announced LogFire. I was kind of only familiar with LogFire because of your Series A announcement. I actually thought you were making a separate company. I remember some amount of confusion with you when that came out. So to be clear, it's Pydantic LogFire and the company is one company that has kind of two products, an open source thing and an observability thing, correct? Yeah. I was just kind of curious, like any learnings building LogFire? So classic question is, do you use ClickHouse? Is this like the standard persistence layer? Any learnings doing that?Samuel [00:43:54]: We don't use ClickHouse. We started building our database with ClickHouse, moved off ClickHouse onto Timescale, which is a Postgres extension to do analytical databases. Wow. And then moved off Timescale onto DataFusion. And we're basically now building, it's DataFusion, but it's kind of our own database. Bogomil is not entirely happy that we went through three databases before we chose one. I'll say that. But like, we've got to the right one in the end. I think we could have realized that Timescale wasn't right. I think ClickHouse. They both taught us a lot and we're in a great place now. But like, yeah, it's been a real journey on the database in particular.Swyx [00:44:28]: Okay. So, you know, as a database nerd, I have to like double click on this, right? So ClickHouse is supposed to be the ideal backend for anything like this. And then moving from ClickHouse to Timescale is another counterintuitive move that I didn't expect because, you know, Timescale is like an extension on top of Postgres. Not super meant for like high volume logging. But like, yeah, tell us those decisions.Samuel [00:44:50]: So at the time, ClickHouse did not have good support for JSON. I was speaking to someone yesterday and said ClickHouse doesn't have good support for JSON and got roundly stepped on because apparently it does now. So they've obviously gone and built their proper JSON support. But like back when we were trying to use it, I guess a year ago or a bit more than a year ago, everything happened to be a map and maps are a pain to try and do like looking up JSON type data. And obviously all these attributes, everything you're talking about there in terms of the GenAI stuff. You can choose to make them top level columns if you want. But the simplest thing is just to put them all into a big JSON pile. And that was a problem with ClickHouse. Also, ClickHouse had some really ugly edge cases like by default, or at least until I complained about it a lot, ClickHouse thought that two nanoseconds was longer than one second because they compared intervals just by the number, not the unit. And I complained about that a lot. And then they caused it to raise an error and just say you have to have the same unit. Then I complained a bit more. And I think as I understand it now, they have some. They convert between units. But like stuff like that, when all you're looking at is when a lot of what you're doing is comparing the duration of spans was really painful. Also things like you can't subtract two date times to get an interval. You have to use the date sub function. But like the fundamental thing is because we want our end users to write SQL, the like quality of the SQL, how easy it is to write, matters way more to us than if you're building like a platform on top where your developers are going to write the SQL. And once it's written and it's working, you don't mind too much. So I think that's like one of the fundamental differences. The other problem that I have with the ClickHouse and Impact Timescale is that like the ultimate architecture, the like snowflake architecture of binary data in object store queried with some kind of cache from nearby. They both have it, but it's closed sourced and you only get it if you go and use their hosted versions. And so even if we had got through all the problems with Timescale or ClickHouse, we would end up like, you know, they would want to be taking their 80% margin. And then we would be wanting to take that would basically leave us less space for margin. Whereas data fusion. Properly open source, all of that same tooling is open source. And for us as a team of people with a lot of Rust expertise, data fusion, which is implemented in Rust, we can literally dive into it and go and change it. So, for example, I found that there were some slowdowns in data fusion's string comparison kernel for doing like string contains. And it's just Rust code. And I could go and rewrite the string comparison kernel to be faster. Or, for example, data fusion, when we started using it, didn't have JSON support. Obviously, as I've said, it's something we can do. It's something we needed. I was able to go and implement that in a weekend using our JSON parser that we built for Pydantic Core. So it's the fact that like data fusion is like for us the perfect mixture of a toolbox to build a database with, not a database. And we can go and implement stuff on top of it in a way that like if you were trying to do that in Postgres or in ClickHouse. I mean, ClickHouse would be easier because it's C++, relatively modern C++. But like as a team of people who are not C++ experts, that's much scarier than data fusion for us.Swyx [00:47:47]: Yeah, that's a beautiful rant.Alessio [00:47:49]: That's funny. Most people don't think they have agency on these projects. They're kind of like, oh, I should use this or I should use that. They're not really like, what should I pick so that I contribute the most back to it? You know, so but I think you obviously have an open source first mindset. So that makes a lot of sense.Samuel [00:48:05]: I think if we were probably better as a startup, a better startup and faster moving and just like headlong determined to get in front of customers as fast as possible, we should have just started with ClickHouse. I hope that long term we're in a better place for having worked with data fusion. We like we're quite engaged now with the data fusion community. Andrew Lam, who maintains data fusion, is an advisor to us. We're in a really good place now. But yeah, it's definitely slowed us down relative to just like building on ClickHouse and moving as fast as we can.Swyx [00:48:34]: OK, we're about to zoom out and do Pydantic run and all the other stuff. But, you know, my last question on LogFire is really, you know, at some point you run out sort of community goodwill just because like, oh, I use Pydantic. I love Pydantic. I'm going to use LogFire. OK, then you start entering the territory of the Datadogs, the Sentrys and the honeycombs. Yeah. So where are you going to really spike here? What differentiator here?Samuel [00:48:59]: I wasn't writing code in 2001, but I'm assuming that there were people talking about like web observability and then web observability stopped being a thing, not because the web stopped being a thing, but because all observability had to do web. If you were talking to people in 2010 or 2012, they would have talked about cloud observability. Now that's not a term because all observability is cloud first. The same is going to happen to gen AI. And so whether or not you're trying to compete with Datadog or with Arise and Langsmith, you've got to do first class. You've got to do general purpose observability with first class support for AI. And as far as I know, we're the only people really trying to do that. I mean, I think Datadog is starting in that direction. And to be honest, I think Datadog is a much like scarier company to compete with than the AI specific observability platforms. Because in my opinion, and I've also heard this from lots of customers, AI specific observability where you don't see everything else going on in your app is not actually that useful. Our hope is that we can build the first general purpose observability platform with first class support for AI. And that we have this open source heritage of putting developer experience first that other companies haven't done. For all I'm a fan of Datadog and what they've done. If you search Datadog logging Python. And you just try as a like a non-observability expert to get something up and running with Datadog and Python. It's not trivial, right? That's something Sentry have done amazingly well. But like there's enormous space in most of observability to do DX better.Alessio [00:50:27]: Since you mentioned Sentry, I'm curious how you thought about licensing and all of that. Obviously, your MIT license, you don't have any rolling license like Sentry has where you can only use an open source, like the one year old version of it. Was that a hard decision?Samuel [00:50:41]: So to be clear, LogFire is co-sourced. So Pydantic and Pydantic AI are MIT licensed and like properly open source. And then LogFire for now is completely closed source. And in fact, the struggles that Sentry have had with licensing and the like weird pushback the community gives when they take something that's closed source and make it source available just meant that we just avoided that whole subject matter. I think the other way to look at it is like in terms of either headcount or revenue or dollars in the bank. The amount of open source we do as a company is we've got to be open source. We're up there with the most prolific open source companies, like I say, per head. And so we didn't feel like we were morally obligated to make LogFire open source. We have Pydantic. Pydantic is a foundational library in Python. That and now Pydantic AI are our contribution to open source. And then LogFire is like openly for profit, right? As in we're not claiming otherwise. We're not sort of trying to walk a line if it's open source. But really, we want to make it hard to deploy. So you probably want to pay us. We're trying to be straight. That it's to pay for. We could change that at some point in the future, but it's not an immediate plan.Alessio [00:51:48]: All right. So the first one I saw this new I don't know if it's like a product you're building the Pydantic that run, which is a Python browser sandbox. What was the inspiration behind that? We talk a lot about code interpreter for lamps. I'm an investor in a company called E2B, which is a code sandbox as a service for remote execution. Yeah. What's the Pydantic that run story?Samuel [00:52:09]: So Pydantic that run is again completely open source. I have no interest in making it into a product. We just needed a sandbox to be able to demo LogFire in particular, but also Pydantic AI. So it doesn't have it yet, but I'm going to add basically a proxy to OpenAI and the other models so that you can run Pydantic AI in the browser. See how it works. Tweak the prompt, et cetera, et cetera. And we'll have some kind of limit per day of what you can spend on it or like what the spend is. The other thing we wanted to be able to do was to be able to when you log into LogFire. We have quite a lot of drop off of like a lot of people sign up, find it interesting and then don't go and create a project. And my intuition is that they're like, oh, OK, cool. But now I have to go and open up my development environment, create a new project, do something with the right token. I can't be bothered. And then they drop off and they forget to come back. And so we wanted a really nice way of being able to click here and you can run it in the browser and see what it does. As I think happens to all of us, I sort of started seeing if I could do it a week and a half ago. Got something to run. And then ended up, you know, improving it. And suddenly I spent a week on it. But I think it's useful. Yeah.Alessio [00:53:15]: I remember maybe a couple, two, three years ago, there were a couple of companies trying to build in the browser terminals exactly for this. It's like, you know, you go on GitHub, you see a project that is interesting, but now you got to like clone it and run it on your machine. Sometimes it can be sketchy. This is cool, especially since you already make all the docs runnable in your docs. Like you said, you kind of test them. It sounds like you might just have.Samuel [00:53:39]: So, yeah. The thing is that on every example in Pydantic AI, there's a button that basically says run, which takes you into Pydantic.run, has that code there. And depending on how hard we want to push, we can also have it like hooked up to LogFire automatically. So there's a like, hey, just come and join the project. And you can see what that looks like in LogFire.Swyx [00:53:58]: That's super cool.Alessio [00:53:59]: So I think that's one of the biggest personally for me, one of the biggest drop offs from open source projects. It's kind of like do this. And then as long as something as soon as something doesn't work, I just drop off.Swyx [00:54:09]: So it takes some discipline. You know, like there's been very many versions of this that I've been through in my career where you had to extract this code and run it. And it always falls out of date. Often we would have these this concept of transclusion where we have a separate code examples repo that we want to be that and that we pulled into our docs. And it never never really works. It takes a lot of discipline. So kudos to you on this.Samuel [00:54:31]: And it was it was years of maintaining Pydantic and people complaining, hey, that example is out of date now. But eventually we went and built a PyTest example. Which is another the hardest to search for open source project we ever built. Because obviously, as you can imagine, if you search PyTest examples, you get examples of how to use PyTest. But the PyTest examples will basically go through both your code inside your doc strings to look for Python code and through markdown in your docs and extract that code and then run it for you and run linting over it and soon run type checking over it. So and that's how we keep our examples up to date. But now now we have these like hundreds of examples. All of which are runnable and self-contained. Or if they if they refer to the previous example, it's already structured that they have to be able to import the code from the previous example. So why don't we give someone a nice place to just be able to actually run that using OpenAI and see what the output is. Lovely.Alessio [00:55:24]: All right. So that's kind of Pydantic. And the notes here, I just like going through people's X account, not Twitter. So for four years, you've been saying we need a plain text accessor to Jupyter notebooks. Yeah. I think people maybe have gone the other way, which may get even more opinionated, like with X and like all these kind of like notebook companies.Samuel [00:55:46]: Well, yes. So in reply to that, someone replied and said Marimo is that. And sure enough, Marimo is really impressive. And I've subsequently spoken to spoken to the Marimo guys and got to angel invest in their account. I think it's SeedGround. So like Marimo is very cool. It's doing that. And Marimo also notebooks also run in the browser again using Pyodide. In fact, I nearly got there. We didn't build Pydantic.run because we were just going to use Marimo. But my concern was that people would think LogFire was only to be used in notebooks. And I wanted something that like ironically felt more basic, felt more like a terminal so that no one thought it was like just for notebooks. Yeah.Swyx [00:56:22]: There's a lot of notebook haters out there.Samuel [00:56:24]: And indeed, I have very strong opinions about, you know, proper like Jupyter notebooks. This idea that like you have to run the cells in the right order. I mean, a whole bunch of things. It's basically like worse than Excel or similar. Similarly bad to Excel. Oh, so you are a notebook hater that invested in a notebook. I have this rant called notebook, which was like my attempt to build an alternative that is mostly just a rant about the 10 reasons why notebooks are just as bad as Excel. But Marimo et al, the new ones that are text-based, at least solve a whole bunch of those problems.Swyx [00:56:58]: Agree with that. Yes. I was kind of wishing for something like a better notebook. And then I saw Marimo. I was like, oh, yeah, these guys have are ahead of me on this. Yeah. I don't know if I would do the sort of annotation-based thing. Like, you know, a lot of people love the, oh, annotate this function. And it just adds magic. I think similarly to what Jeremy Howard does with his stuff. It seems a little bit too magical still. But hey, it's a big improvement from notebooks. Yeah.Samuel [00:57:23]: Yeah. Great.Alessio [00:57:24]: Just as on the LLM usage, like the IPyMB file, it's just not good to put in LLMs. So just that alone, I think should be okay.Swyx [00:57:36]: It's just not good to put in LLMs.Alessio [00:57:38]: It's really not. They freak out.Samuel [00:57:41]: It's not good to put in Git either. I mean, I freak out.Swyx [00:57:44]: Okay. Well, we will kill IPyMB at some point. Yeah. Any other takes? I was going to ask you just like, broaden out just about the London scene. You know, what's it like building out there, you know, over the pond?Samuel [00:57:56]: I'm an evening person. And the good thing is that I can get up late and then work late because I'm speaking to people in the U.S. a lot of the time. So I got invited just earlier today to some drinks reception.Samuel [00:58:09]: So I'm feeling positive about the U.K. right now on AI. But I think, look, like everywhere that isn't the U.S. and China knows that we're like way behind on AI. I think it's good that the U.K. is like beginning to say, this is an opportunity, not just a risk. I keep being told you should be at more events. You should be like, you know, hanging out with AI people more. My instinct is like, I'd rather sit at my computer and write code. I think that like, is probably a more effective way of getting people's attention. I'm like, I don't know. I mean, like a bit of me thinks I should be sitting on Twitter, not in San Francisco chatting to people. I think it's probably a bit of a mixture and I could probably do with being in the States a bit more. I think I'm going to be over there a bit more this year. But like, there's definitely the risk if you're in somewhere where everyone wants to chat to you about code where you don't write any code. And that's a failure mode.Swyx [00:58:58]: I would say, yeah, definitely for sure. There's a scene and, you know, one way to really fail at this is to just be involved in that scene. And have that eat up your time, but be at the right events and the ones that I'm running are good events, hopefully.Swyx [00:59:16]: What I say is like, use those things to produce high quality content that travels in a different medium than you normally would be able to. Because there's some selectivity, because there's a broad, there's a focused community on that thing. They will discover your work more. It will be highly produced, you know, that's the pitch over there on why at least I do conferences. And then in terms of talking to people, I always think about this, a three strikes rule. So after a while it gets repetitive, but maybe like the first 10, 20 conversations you have about people, if the same stuff is coming up, that is an indication to you that people like want a thing and it helps you prioritize in a more long form way than you can get in shallow interactions online, right? So that in person, eye to eye, like this is my pain at work and you see the pain and you're like, oh, okay. Like if I do this for you. You will love our tool and like, you can't really replace that. It's customer interviews. Really. Yeah.Samuel [01:00:11]: I agree entirely with that. I think that I think there's a, you're, you're right on a lot of that. And I think that like, it's very easy to get distracted by what people are saying on Twitter and LinkedIn.Swyx [01:00:19]: That's another thing.Samuel [01:00:20]: It's pretty hard to correct for which of those people are actually building this stuff in production in like serious companies and which of them are on day four of learning to code. Cause they have equally strident opinions and in like few characters, they, they seem equally valid. But which one's real and which one's not, or which one is from someone who really knows their stuff is, is hard to know.Alessio [01:00:40]: Anything else, Sam? What do you want to get off your chest?Samuel [01:00:43]: Nothing in particular. I think we, I've really enjoyed our conversation. I would say, I think if anyone who is like looked at, at Pydance AI, we know it's not complete yet. We know there's a bunch of things that are missing embeddings, like storage, MCP and tool sets and stuff like that. We're trying to be deliberate and do stuff well. And that involves not being feature complete yet. Like keep coming back and looking in a few months because we're, we're pretty determined to get that. We know that this stuff is like, whether or not you think that AI is going to be the next Excel, the next internet or the next industrial revolution is going to affect all of us enormously. And so as a company, we get that like making Pydantic AI the best agent framework is existential for us.Alessio [01:01:22]: You're also the first series A company I see that has no open roles for now. Every founder that comes in our podcast, the call to action is like, please come work with us.Samuel [01:01:31]: We are not hiring right now. I want to, I would love, uh, bluntly for Logfire to have a bit more commercial traction and a bit more revenue before I, before I hire some more people. It's quite nice having a few years of runway, not a few months of runway. So I'm not in any, any great appetite to go and like destroy that runway overnight by hiring another, another 10 people. Even if like we, the whole team is like rushed off their feet, kind of doing, as you said, like three to four startups at the same time.Alessio [01:01:58]: Awesome, man. Thank you for joining us.Samuel [01:01:59]: Thank you very much. Get full access to Latent.Space at www.latent.space/subscribe
    --------  
    1:04:04
  • The Agent Reasoning Interface: o1/o3, Claude 3, ChatGPT Canvas, Tasks, and Operator — with Karina Nguyen of OpenAI
    Sponsorships and tickets for the AI Engineer Summit are selling fast! See the new website with speakers and schedules live! If you are building AI agents or leading teams of AI Engineers, this will be the single highest-signal conference of the year for you, this Feb 20-22nd in NYC.We’re pleased to share that Karina will be presenting OpenAI’s closing keynote at the AI Engineer Summit. We were fortunate to get some time with her today to introduce some of her work, and hope this serves as nice background for her talk!There are very few early AI careers that have been as impactful as Karina Nguyen’s. After stints at Notion, Square, Dropbox, Primer, the New York Times, and UC Berkeley, She joined Anthropic as employee ~60 and worked on a wide range of research/product roles for Claude 1, 2, and 3. We’ll just let her LinkedIn speak for itself:Now, as Research manager and Post-training lead in Model Behavior at OpenAI, she creates new interaction paradigms for reasoning interfaces and capabilities, like ChatGPT Canvas, Tasks, SimpleQA, streaming chain-of-thought for o1 models, and more via novel synthetic model training. Ideal AI Research+Product ProcessIn the podcast we got a sense of what Karina has found works for her and her team to be as productive as they have been:* Write PRD (Define what you want)* Funding (Get resources)* Prototype Prompted Baseline (See what’s possible)* Write and Run Evals (Get failures to hillclimb)* Model training (Exceed baseline without overfitting)* Bugbash (Find bugs and solve them)* Ship (Get users!)We could turn this into a snazzy viral graphic but really this is all it is. Simple to say, difficult to do well. Hopefully it helps you define your process if you do similar product-research work. Show Notes* Our Reasoning Price War post * Karina LinkedIn, Website, Twitter* OSINT visualization work* Ukraine 3D storytelling* Karina on Claude Artifacts* Karina on Claude 3 Benchmarks* Inspiration for Artifacts / Canvas from early UX work she did on GPT-3* “i really believe that things like canvas and tasks should and could have happened like 2 yrs ago, idk why we are lagging in the form factors” (tweet)* Our article on prompting o1 vs Karina’s Claude prompting principles* Canvas: https://openai.com/index/introducing-canvas/ * We trained GPT-4o to collaborate as a creative partner. The model knows when to open a canvas, make targeted edits, and fully rewrite. It also understands broader context to provide precise feedback and suggestions.To support this, our research team developed the following core behaviors:* Triggering the canvas for writing and coding* Generating diverse content types* Making targeted edits* Rewriting documents* Providing inline critiqueWe measured progress with over 20 automated internal evaluations. We used novel synthetic data generation techniques, such as distilling outputs from OpenAI o1-preview, to post-train the model for its core behaviors. This approach allowed us to rapidly address writing quality and new user interactions, all without relying on human-generated data.* Tasks: https://www.theverge.com/2025/1/14/24343528/openai-chatgpt-repeating-tasks-agent-ai* * Agents and Operator* What are agents? “Agents are a gradual progression of tasks: starting with one-off actions, moving to collaboration, and ultimately fully trustworthy long-horizon delegation in complex envs like multi-player/multiagents.” (tweet)* tasks and canvas fall within the first two, and we are def. marching towards the third—though the form factor for 3 will take time to develop * Operator/Computer Use Agents* https://openai.com/index/introducing-operator/* Misc:* Andrew Ng* Prediction: Personal AI Consumer playbook* ChatGPT as generative OSTimestamps* 00:00 Welcome to the Latent Space Podcast* 00:11 Introducing Karina Nguyen* 02:21 Karina's Journey to OpenAI* 04:45 Early Prototypes and Projects* 05:25 Joining Anthropic and Early Work* 07:16 Challenges and Innovations at Anthropic* 11:30 Launching Claude 3* 21:57 Behavioral Design and Model Personality* 27:37 The Making of ChatGPT Canvas* 34:34 Canvas Update and Initial Impressions* 34:46 Differences Between Canvas and API Outputs* 35:50 Core Use Cases of Canvas* 36:35 Canvas as a Writing Partner* 36:55 Canvas vs. Google Docs and Future Improvements* 37:35 Canvas for Coding and Executing Code* 38:50 Challenges in Developing Canvas* 41:45 Introduction to Tasks* 41:53 Developing and Iterating on Tasks* 46:27 Future Vision for Tasks and Proactive Models* 52:23 Computer Use Agents and Their Potential* 01:00:21 Cultural Differences Between OpenAI and Anthropic* 01:03:46 Call to Action and Final ThoughtsTranscriptAlessio [00:00:04]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel, and I'm joined by my usual co-host, Swyx.swyx [00:00:11]: Hey, and today we're very, very blessed to have Karina Nguyen in the studio. Welcome.Karina [00:00:15]: Nice to meet you.swyx [00:00:16]: We finally made it happen. We finally made it happen. First time we tried this, you were working at a different company, and now we're here. Fortunately, you had some time, so thank you so much for joining us. Yeah, thank you for inviting me. Karina, your website says you lead a research team in OpenAI, creating new interaction paradigms for reasoning interfaces and capabilities like ChatGPT Canvas, and most recently, ChatGPT TAS. I don't know, is that what we're calling it? Streaming chain of thought for O1 models and more via novel synthetic model training. What is this research team?Karina [00:00:45]: Yeah, I need to clarify this a little bit more. I think it changed a lot since the last time we launched. So we launched Canvas, and it was the first project. I was a tech lead, basically, and then I think over time I was trying to refine what my team is, and I feel like it's at the intersection of human-computer interaction, defining what the next interaction paradigms might look like with some of the most recent reasoning models, as well as actually trying to come up with novel methods, how to improve those models for certain tasks if you want to. So for Canvas, for example, one of the most common use cases is basically writing and coding. And we're continually working on, okay, how do we make Canvas coding to go beyond what is possible right now? And that requires us to actually do our own training and coming up with new methods of synthetic data generation. The way I'm thinking about it is that my team is going from very full stack, from training models all the way up to deployment and making sure that we create novel product features that is coherent to what you're doing. So we're really working on that.swyx [00:02:08]: So it's, it's a lot of work to do right now. And I think that's why I think it's such a great opportunity. You know, how could something this big work in like an industrial space and in the things that we're doing, you know, it's a really exciting time for us. And it's just, you know, it's a lot of work, but what I really like about working in digital space is the, you know, the visual space is always the best place to stay. It's not just the skill sets that need to be done.Alessio [00:02:17]: Like we have, like, a lot of things to be done, but like, we've got a lot of different, you know, things to come up with. I know you have some early UX prototypes with GPT-3 as well, and kind of like maybe how that is informed, the way you build products.Karina [00:02:32]: I think my background was mostly like working on computer vision applications for like investigative journalism. Back when I was like at school at Berkeley, and I was working a lot with like Human Rights Center and like investigative journalists from various media. And that's how I learned more about like AI, like with vision transformers. And at that time, I was working with some of the professors at Berkeley AI Research.swyx [00:03:00]: There are some Pulitzer Prize winning professors, right, that teach there?Karina [00:03:04]: No, so it's mostly like was reporting for like teams like the New York Times, like the AP Associated Press. So it was like all in the context of like Human Rights Center. Got it. Yeah. So that was like in computer vision. And then I saw... I saw Crisolo's work around, you know, like interpretability from Google. And that's how I found out about like Anthropic. And at that time, I was just like, I think it was like the year when like Ukraine's war happened. And I was like trying to find a full-time job. And it was kind of like all got distracted. It was like kind of like spring. And I was like very focused on like figuring out like what to do. And then my best option at that time was just like continue my internship. At the New York Times and convert to like full-time. At the New York Times, it was just like working on like mostly like product engineering work around like R&D prototypes, kind of like storytelling features on the mobile experience. So it kind of like storytelling experiences. And like at that time, we were like thinking about like how do we employ like NLP techniques to like scrape some of the archives from the New York Times or something. But then I always wanted to like get into like AI. And like I knew OpenAI for a while, like since I was like, and I was like, I don't know, I don't know. So I kind of like applied to Anthropic just on the website. And I was rejected the first time. But then at that time, they were not hiring for like anything like product engineering or front-end engineering, which was something I was like, at that time, I was like interested in. And then there was like a new opening at Anthropic was like kind of like you are front-end engineer. And so I applied. And that's how my journey began. But like the earlier prototypes was mostly like I used like Clip.swyx [00:05:13]: We'll briefly mention that the Ukrainian crisis actually hit home more for you than most people because you're from the Ukraine and you moved here like for school, I guess. Yeah.Karina [00:05:23]: Yeah.swyx [00:05:23]: We'll come back to that if it comes up. But then you joined Anthropic, not just as a front-end engineer. You were the first. Is that true? Designer? Yeah.Karina [00:05:32]: Yes. I think like I did both product design and front-end engineering together. And like at that time it was like pre-CHPT. It was like, I think August 2022. And that was a time when Anthropic really decided to like do more product-y related things. And the vision was like, we need to like fund research and like building product is like the best way to like fund safety research, which I find it quite admirable. So the really first product that Anthropic built was like Cloud and Slack. And it was sunsetted not long after, but like it was like one of the first, I think I still come back to that idea of like Cloud operating inside some of the organizational workplace like Slack and something magical in there. And I remember we built like ideas like summarize the thread, but you can like imagine having automated like ways of like, maybe Cloud should like summarize multiple channels every week, custom for what you like or for what you want. And then we built some like really cool features. Like this. So we could like tag Cloud and then ask to summarize what's what happened in the thread. So just like new ideas, but we didn't quite double down because you could like imagine like Cloud having access to like the files or like Google drive that you can upload and just connectors, like connections in the Slack. Also the UX was kind of constraining at that time. I was thinking like, oh, we wanted to do this feature, but like Slack interface kind of constrained us to like do that. And we didn't want to like be dependent on the platform, like Slack. And then after like ChaiGPT came out, I remember the first two weeks, my manager made me this challenge, like, can I like reproduce kind of like a similar interface in like two weeks? And one of the early mistakes being in the engineering is like, I said, yes, instead I should have said like, you know, it's double, two X at the time. Sure. Um, and this is how like Cloud.ai was kind of like born.swyx [00:07:39]: Oh, so you actually wrote Cloud.ai? Yeah. As your first job. Yeah.Karina [00:07:43]: Like, I think like the first like 50,000 code of lines without any reviews at that time, because there's no one, um, yeah, it was like very small team. It was all like six, seven team who we were called like deployment team. Yeah.swyx [00:07:59]: Oh, mine, I actually interviewed for, uh, at Anthropic around that time. I got, I was given Cloud in Sheets and that was my other form factor. I was like, oh yeah, this needs to be in a table so we can, we can just copy paste and just span it out. Uh, which is kind of cool. The other rumor that, um, we might as well just mention this, um, Raza Habib from HumanLoop, uh, often says that, uh, you know, there was some, there's some version of ChatGPT in Anthropic, like you had the chat interface already, like you had Slack, why not launch a web UI? Like basically like how did, how did OpenAI beat Anthropic to ChatGPT basically? Um, well, it seems kind of obvious to have it.Karina [00:08:35]: I think ChatGPT model itself came out way before then we decided to like launch Cloud2 necessarily. And I think like at that time, Cloud 1.3 had a lot of hallucinations actually. So I think there was like, one of the concerns is like, I don't think like the leadership was convinced, had the conviction that this is the model that you need to like, you want to like deploy or something. So it was a lot of discussions around, around that time. But Cloud 1.3 was like, I don't know if you played with that, but it's like extremely creative and it was like really cool.swyx [00:09:07]: Nice.Alessio [00:09:08]: It's still creative. And you had a tweet. Recently that you said things like Canvas and Tasks could have happened two years ago, but they were not. Do you know why they were not? Was it too many researchers at the labs not focused on UX? Was it just not a priority for the labs?Karina [00:09:24]: Yeah. I come back to that question a lot. I guess like I was working on something similar to like Canvas-y, but for Cloud at that time in like 2023, it was the same similar idea of like Cloud workspace where a human and a Cloud could have like a shared workspace. Yeah. And that's Artifacts. Which is like a document. Right.swyx [00:09:44]: No, no, no. This is Cloud projects.Karina [00:09:46]: I don't know. I think it kind of evolved. I think like at that time I was like in product engineering team and then I switched to like research team and the product engineering team grew so much. They had their own ideas of like artifacts and like projects. So not necessarily, maybe they had, they looked at my like previous explorations, but like, you know, when I was exploring like Cloud documents or like Cloud workspace was like. Yeah. I don't think anybody was thinking about UX as much or like not many like researchers understood that. And I think the inspiration actually for, I still have like all the sketches, but the inspiration was like from the Harry Potter, like Tom Riddler diary. That was an inspiration of like having Cloud writing into the document or something and communicate back.swyx [00:10:34]: So like in the movie you write a little bit and then it answers you. Yeah.Karina [00:10:37]: Okay.swyx [00:10:38]: Interesting.Karina [00:10:39]: But that was like in the. Only in the context of like writing. I think Canvas is like more also serves like coding, one of the most common use cases. But yeah, I think like those, those ideas could have happened like two years ago. Just like maybe, I don't think it was like a priority at that time. It was like very unclear. I think like AI landscape at that time was very nascent. If that makes sense. Like nobody, like, even when I would talk to like some of the designers at that time, like product designers, they were not even thinking about that at all. They did not have like AI in mind. And like, it's kind of interesting, except for one of my designer friends. His name is Jason Yuan. Yeah. Who was thinking about that.swyx [00:11:19]: And Jason now is a new computer. Yes. We'll have them on at some point. I had them speak at my first summit and you're speaking the second one, which will be really fun. Nice. We'll stay on Anthropic for a bit and then we'll move on to more recent things. I think the other big project that you were, you were involved with was just Cloud 3. Just tell us the story. Like, what was it like to launch one of the biggest launches of the year? Yeah.Karina [00:11:39]: I think like I was, so Cloud 3.swyx [00:11:43]: This is Haiku, Sonnet, Opus all at once, right? Yes. Yeah.Karina [00:11:46]: It was a Cloud 3 family. I was a part of the post-training fine tuning team. We only had like, what, like 10, 12 people involved. And it was really, really fun to like work together as friends. So yeah, I was mostly involved in like Cloud 3 Haiku post-training side and then evaluations, like developing new evaluations. And like literally writing the entire like model card. And I had a lot of fun. I think like the way you train the model is like very different, obviously. But I think what I've learned is that like you will end up with like, I don't know, like 70 models and every model will have its own like brain damage. And like, so it's just like, like kind of just bugs.swyx [00:12:28]: Like personality wise or performance benchmarks?Karina [00:12:31]: I think every model is very different. And I think like, it's like one of the interesting like research questions is like, how do you understand like the data interface? How do you understand the interactions as you like train the model? It's like, if you train the model on like contradictory data sets, how can you make sure that there won't be like any like weird like side effects? And sometimes you get like side effects. And like the learning is that you have to like iterate very rapidly and like have to like debug and detect it and make like address it with like interventions. And actually some of the techniques from like software engineering is very like useful here. It's like, how do you- Yeah, exactly.swyx [00:13:09]: So I really empathize with this because data sets, if you put in the wrong one, you can basically kind of screw up like the past month of training. The problem with this for me is the existence of YOLO runs. I cannot square this with YOLO runs. If you're telling me like you're taking such care about data sets, then every day I'm going to check in, run evals and do that stuff. But then we also know that YOLO runs exist. Yes. So how do you square that?Karina [00:13:32]: Well, I think it's like dependent on how much compute you have. Right? So it's like, it's actually a lot of questions and like researchers are like, how do you most effectively use the compute that you have? And maybe you can have like two to three runs that is only like YOLO runs. But if you don't have a luxury of that, like you kind of need to like prioritize ruthlessly. Like what are the experiments that are most important to like run? Yeah. I think this is what like research management is basically. It's like, how do you-swyx [00:14:04]: Funding efforts. Yeah. Yeah. Prioritizing.Karina [00:14:07]: Take like research bets and make sure that you build the conviction and those bets rapidly such that if they work out, you like double down on them. Yeah.swyx [00:14:15]: You almost have to like kind of ablate data sets too and like do it on the side channel and then merge it in. Yeah. It's kind of super interesting. Tell us more, like what's your favorite? So you, I have this in front of me, the model card. You say constructing this painful, this table was slightly painful. Just pick a benchmark and what's an interesting story behind one of them?Karina [00:14:33]: I would say GPQA was kind of interesting. I think it was like the first, I think we were the first lab, like Antarctica was the first lab to like run.swyx [00:14:42]: Oh, because it was like relatively new after NeurIPS? Yeah.Karina [00:14:45]: Yeah. Okay. Published GPQA like numbers. And I think one of the things that we've learned was that I personally learned about that, like any evals is like, some evals are like very like high variance. And like GPQA is like, happened to be like a huge like high variance. Like evaluation. So like one thing that we did is like having like run the average of like five and like take the average. But like the hardest thing about like the model card is like none of the numbers are like apples to apples. Yes. Will knows this. So you actually need to like go back to like, I don't know, like GPT-4 model card and like read the appendix just to like make sure that like the settings are the same as you're running the settings too. So it's like never an apples to apples. Yeah. But it's interesting how like, you know, when you market models as products, like customers don't necessarily know. Yeah. Like.swyx [00:15:44]: They're just like, my MMLU is 99. What do you mean? Yeah, exactly. Why isn't there an industry standard harness, right? There's this eLuther's thing, which it seems like none of the model labs use. And then OpenAI put out simple eval and nobody uses that. Why isn't there just one standard way everyone runs this? Because the alternative approach is you rerun your evals on their models. And obviously the numbers, your numbers will be lower. Yeah. And they'll be unhappy. So that's why you don't do that.Karina [00:16:12]: I think it operates on an assumption that like the models, the next generation of the model or the model that you produce next is going to behave the same. So for example, like I think the way you prompt a one or like a cloud three is going to be very different from each other. I feel like there's a lot of like prompting that you need to do to get the evals to run correctly. So sometimes the model will just like output like new lines and the way it parsed will be like incorrect or something. This has happened with like Stanford. I remember like when Stanford had this also like they were like running benchmarks. Helm? Yeah, Helm. And somehow like cloud was like always like not performing well. And that's because like the way they prompted it was kind of wrong. So it's like a lot of like techniques. Yeah. It's just like very hard because like nobody even knows.swyx [00:17:00]: Has that gone away with chat models instead of, you know, just raw completion models?Karina [00:17:05]: Yeah, I guess like each eval also can be run in a very different way. Sometimes you can like ask the model to output in like XML tags, but some models are not really good at XML tags. So it's like, do you change the formatting per model or like do you run the same format across all models? And then like the metrics themselves, right? Like maybe, you know, accuracy is like one thing, but maybe you care about like some other metrics like F score or like some other like things. Yeah. It's like hard. I don't know.Alessio [00:17:36]: And talking about O1 prompting, we just had a O1 prompting post on the newsletter, which I think was...swyx [00:17:42]: Apparently it went viral within OpenAI. Yeah. I don't know. I got pinged by other OpenAI people. They were like, is this helpful to us? I'm like, okay. Oh, nice. Yeah.Alessio [00:17:50]: I think it's like maybe one of the top three most read posts now. Yeah. Cool. And I didn't write it. Okay. Exactly.swyx [00:17:57]: Anyway, go ahead.Alessio [00:17:57]: What are your tips on O1 versus like cloud prompting or like what are things that you took away from that experience? And especially now, I know that with 4.0 for Canvas, you've done RL after on the model. So yeah, just general learning. So now to think about prompting these models differently.Karina [00:18:12]: I actually think like O1, I did not even harness the magic of like O1 prompting. But like one thing that I found is that like, if you give O1 like hard, like constraints of like what you're doing. What you're looking for, basically the model will be, will have a much easier time to like kind of like select the candidates and match like the candidate that is most like fulfilled the criteria that you gave. And I think there's a class of problems like this that O1 excels at. For example, if you have a question, like a bio question on like some, or like in chemistry, right? Like if you have like very specific criteria with the protein or like some of the. Chemical bindings or something like, then the model will be really, will be really good at like determining the exact candidate that will match the certain criteria.swyx [00:19:04]: I have often thought that we need a new IF eval for this. Because this is basically kind of instruction following, isn't it? Yes. But I don't think IF eval has like multi-step IF eval. Yeah. So that's what basically I use AI News for. I have a lot of prompts and a lot of steps and a lot of criteria and O1 just kind of checks through each kind of systematically. And we don't have any evals like that.Karina [00:19:24]: Yeah.Alessio [00:19:25]: Does OpenAI know how to prompt O1? I think that's kind of like the, that's the, you know, Sam is always talking about incremental deployments and kind of like getting, having people getting used to it. When you release a model, you obviously do all the safety testing, but do you feel like people internally know how to get a hundred percent out of the model? Or like, are you also spending a lot of time learning from like the outside on how to better prompt O1 and like all these things? Yeah.Karina [00:19:50]: I certainly think that you learn so much from like external feedback too. Yeah. I feel like I don't fully know on how people use like O1. I think like a lot of people use O1 for like really hardcore like coding questions. I feel like I don't fully know how to best use O1. You release the model. Except for like, I use O1 to just like do some like synthetic data explorations. But that's it.Alessio [00:20:16]: Do people inside of OpenAI, once the model is coming out, do you get like a company-wide memo of like, hey, this is how you should try and prompt this? Yes. Especially for people that might not be close to it during development, you know, or I don't know if you can share anything, but I'm curious how internally these things kind of get shared.Karina [00:20:34]: I feel like I'm like in my own little corner in like research. I don't really like to look at some of the Slack channels.swyx [00:20:40]: It's very, very big.Karina [00:20:41]: So I actually don't know if something like this exists. Probably. It might be exist because we need to share to like customers or like, you know, like some of the guides. I'm like, how do you use this model? So probably there is.swyx [00:20:56]: I often say this. The reason that AI engineering can exist outside of the model labs is because the model labs release models with capabilities that they don't even fully know because you never trained specifically for it. It's emergent. And you can rely on basically crowdsourcing the search of that space or the behavior space to the rest of us. Yeah. So like, you don't have to know. That's what I'm saying. Yeah.Karina [00:21:20]: I think like an interesting thing about like O1 is like. That like it's really for like average human. Sometimes I don't even know whether the model like produced the correct output or not. Like it's really hard for me to like verify even like hard like stem questions. I don't know if I'm not an expert. Like I usually don't know. So it's like the question of like alignment is actually more important like for this like complex reasoning models to like how do we help humans to like verify the outputs of these models is quite important. And I feel like. Yeah. Like learning from external feedback is kind of cool.swyx [00:21:56]: For sure. One last thing on cloud three. You had a section on behavioral design. Yes. Anthropics very famous for the HHH goals. What was your insights there? Or, you know, maybe just talk a little bit about what you explored. Yeah.Karina [00:22:09]: I think like behavioral design is like a really cool. I'm glad that I made it like a section around this. And it's like really cool. I think like.swyx [00:22:17]: Like you weren't going to publish one and then you insisted on it or what?Karina [00:22:20]: I think like I just like put the section. Yeah. I think like I put the section inside it and like, yeah, Jared, my like one of my most favorite researchers like, yeah, that's cool. Let's, let's do that. I guess. Yeah. Like nobody had this like term like behavioral design necessarily for the models. It's kind of like a new little field of like extending like product design into like the model design. Right. Like, so how do you create a behavior for the model in certain contexts? So as for example, like in Canvas, right. Like one of the things that we had to like think about is like, okay, like now the model enters like more collaborative environment, more collaborative context. So like what's the most appropriate behavior for the model to act like as a collaborator? Should it ask like more follow up questions? Should it like change? What's the tone should be? Like what is the collaborator's tone? It's different from like a chat, like conversationalist versus like collaborator. So how do you shape the perspective? Like, you know, like the persona and the personality around that is it has like some philosophical questions too. Like, yeah. Behavioral. I mean, like, I guess like I can talk more about like the methods of like creating the personality. Please. It's the same thing as like you would create like a character in a video game or something. It's kind of like...swyx [00:23:39]: Charisma, intelligence. Yeah, exactly. Wisdom.Karina [00:23:42]: What are the core principles? Helpful, harmless, honest. Yeah. And obviously for Cloud, this was my, is much easier than I would say like for ChargeAPD. For Cloud, it's like baked in the mission, right? It's like honest, harmless, helpful. But the most complicated thing about the model behavior or the behavioral design is that sometimes two values would contradict each other. I think this happened in Cloud 3. One of the main things that we were thinking about was like, how do we balance this like honesty versus like homelessness or like helpfulness? And it's like, we don't want the model to always like refuse even to like innocuous queries, like some like creative writing prompts, but also if you don't want the model to be act like a, be harmful or something. So it's like, there's always a balance between those two. And it's more like art than the science necessarily. And this is what data sets craft is, is like more of an art than a literal science. You can definitely do like empirical research on this, but it's actually like, like this is the idea of like synthetic data. Like if you look back to like institutional AI paper is around like, how do you create completions such that you would agree to certain like principles that you want your model to agree on? So it's like, if you create the core values of the models, how do you decompose those core values? Into like specific scenarios or like, so how does the model need to express its honesty in a variety of kind of like scenarios? And this is where like generalization happens when you craft the persona of the model. Yeah.swyx [00:25:22]: It seems like what you described behavior modification or shaping as a side job that was done. I mean, I think Anthropic has always focused on it the first and the most. But now it's like every lab has sort of. It's like a vibes officer for you guys is Amanda, for OpenAI it's Rune, and then for Google, it's Steven Johnson and Raiza who we had on the podcast. Do you think this is like a job? Like, it's like a, like every, every company needs a tastemaker.Karina [00:25:50]: I think the model's personality is actually the reflection of the company or the reflection of the people who create that model. So like for Claude's, I think Amanda was doing a lot of like Claude character work and I was working with her at the time.swyx [00:26:04]: But there's no team, right? Claude character work. Now there's a little bit of a team. Isn't that cool?Karina [00:26:09]: But before that there was none. I think like actually it was Claude 3, he was like, we kind of doubled down on the feedback from Claude 2. Like people, we didn't even like think, but like people said like Claude 2 is like so much better at like writing and like has certain personality, even though it was like unintentional at all. And we did not pay that much attention and didn't know even how to like productionize this property of model being better. Like personality. And to like, with Claude 3, we kind of like had to like double down because we knew that if you would launch like in chat, we wanted to like Claude honesty is like really good for like enterprise customers. So we kind of wanted to like make sure the hallucinations went, like factuality would like go up or something. We didn't have a team until or after like Claude 3, I guess. Yeah.swyx [00:26:58]: I mean, it's, it's growing now. And I think anyway, everyone's taking it seriously.Karina [00:27:00]: I think on OpenAI there was a team called Model Design. It's John, the PM. She's leading that team and I work very closely with those teams that we were working on, like actually writing improvements that we did with ChaiGPT last year. And then I was working on like this collaboration, like how do you make ChaiGPT act like a collaborator for like Canvas? And then, yeah, we worked together on some of the projects.swyx [00:27:25]: I don't think it's publicly known his, his actual name other than Rune, but he's, he's, he's mostly, he's mostly doxxed.Alessio [00:27:32]: We'll beep it and then people can guess. Yeah. Do we want to move on to OpenAI and some of the recent work, especially you mentioned Canvas. So the first thing about Canvas is like, it's not just a UX thing. You have a different model in the backend, which you post-trained on or one preview distilled data, which was pretty interesting. Can you maybe just run people through, you come up with a feature idea, maybe then how do you decide what goes in the model, what goes in the product and just that, that process? Yeah.Karina [00:28:03]: I think the most unique thing about ChaiGPT Canvas. What I really liked about that was that it was also the team formed out of the air. So it was like July 4th or something... Wow. during the break. Like on Independence Day.swyx [00:28:17]: They just like, okay.Karina [00:28:18]: I think it was, there was some like company break or something. I remember I was just like taking a break and then I was like pitching this idea to like Barrett Zarf. Barrett Zarf, yeah. Who was my manager at that time. Just like, I just want to like create this like Canvas or something. And I really didn't know how to like apply this. Navigate, OpenAI, it was like my first, like, I don't know, like first month at OpenAI and I really didn't know how to like navigate, how do I get product to work with me or like some of the ideas, like some of the things like this was like, so I'm really grateful for like actually Barrett and Mira who helped me to like staff this project basically. And I think that was really cool. And it was like this 4th of July and like Barrett was like, yeah, actually, who's like an engineering manager is like, yeah, we should like staff this project with like five, six engineers or something. And then Karina can be a researcher on this project. And I think like, this is how the team was formed. This was kind of like out of the air. And so like, I didn't know anyone there at that time, except for Thomas Dimson. He did like the first like initial like engineering prototype of the canvas and it kind of like reshaped. But I think the first, we learned a lot on the way how to work together as product and research. And I think this is one of the first projects at OpenAI where research and product work together from the very beginning. And we just made it like a successful project in my opinion is because like designers, engineers, PM and research team were all together. And we would like push back on each other. Like if like it doesn't make sense. Yeah. we'd like to do it on the model side, like we are hard to like collaborate with like applied engineers to like make sure this is being handled on the applied side. But the idea is you can go that far with like prompted baseline, prompt, the charge of PT was kind of like the first thing that we tried was like a canvas as a tool or something. So how do we define the behavior of the canvas? But then like we've found like different like edge cases that we wanted to like fix and the only way to like fix the some of these edge cases actually through post training. So we actually, what we did was actually retrain the entire 4.0 plus our Canvas stuff. And this is like, there are like two reasons why we did this is because like the first one is that we wanted to ship this as a better model in the dropdown menu. We could like rapidly iterate on users' feedback as we ship it and not going through the entire like integration process into like this like new one model or something, which took some time. Right. So I'm like from beta to like GA, it took, I think, three months. So we kind of wanted to like ship our own model with that feature to like learn from the user feedback very quickly. So that was like one of the decisions we made. And then with Canvas itself, we just like had a lot of like different like behavioral, it's again, like it's a behavioral engineering. It's kind of like various behavioral craft around like when does Canvas need to write comments? When does it need to like update or like edit the document? When does it need to like update or like edit the document? When does it need to edit the entire, like rewrite the entire document versus like edit very specific section of the user asks? And when does it need to like trigger the Canvas itself? It was one of those, those like behavioral engineering questions that we had. At that time, I was also working with like writing quality. So that was like the perfect way for us to like literally both teach the model how to use Canvas, but also like improve writing quality if writing was like one of the main use cases for Chachi PD. So I think that was like the reasoning around that.swyx [00:31:55]: There's so many questions. Oh my God. Quick one. What does improved writing quality mean? What are the evals?Karina [00:32:01]: What are the evals? Yeah. So the way I'm thinking about it is like have two various directions. The first direction is like, how do you improve the quality of the writing of the current use cases of Chachi PD? And those, most of the use cases are mostly like nonfiction writings. It's like email writing or like some of the, maybe you've blog posts, cover letters is like one. I don't mean use cases, but then the second one is like, how do we teach the model to literally think more creatively or like write in a more creative manner such that it will like just create novel forms writing. And I think the second one is like much of a longer term, like research question. While the first one is more like, okay, we just need to improve data quality for the writing use cases that between the models are. It is more straightforward question. Okay. But the way we evaluated the writing quality, so actually I worked with Jan's team on the model design. So they had a team of like model writers and we would work together and it's just like a human eval. It's like internal human eval where we would just like that. Yeah. On the prompt distribution that we cared about, like we want to make sure that the models that we like use, that we trained were always like better or something. Yeah.swyx [00:33:20]: So like some test set of like a hundred prompts that you want to make sure you're good on. I don't know. I don't know how big the prompt distribution needs to be because you are literally catering to everyone. Right.Karina [00:33:32]: Yeah. I think it was much more opinionated way of like improving writing quality because we worked together with like model designers to like come up with like core principles of what makes this particular writing good. Like what does make email writing good? And we had to like craft like some of the literally like rubric on like what makes it good and then make sure during the eval, we check the marks on this like rubric. Yeah.swyx [00:33:58]: That's what I do. Yeah. That's what school teachers do. Yeah.Karina [00:34:02]: Yeah. It's really funny.swyx [00:34:03]: Like, yeah, that's exactly how we grade essays. Yes.Karina [00:34:06]: Yeah.Alessio [00:34:06]: I guess my question is when do you work the improvements back in the model? So the canvas model is better writing. Why not just make the core model better too? So for example, I built this small podcasting thing for a podcast and I have the 4.0 API and I asked it to write a write up about the episode based on the transcript. And then I've done the same in canvas. The canvas one is a lot better. Like the one from the raw 4.0, it starts, the podcast delves and I was like, no, I'm not delved in the third word. Why not put them back in 4.0 core or is there just like.Karina [00:34:38]: I think you put it back in the core now.Alessio [00:34:40]: Yeah. So like, so the 4.0 canvas now is the same as 4.0. Yeah. You, you must've missed that update. Yeah. What's the, what's the, what's the process to, I think it's just like an AB test almost. Right. To me, it feels, I mean, I've only tried it like three times. But it feels the canvas, the canvas output feels very different than the API output.Karina [00:35:01]: Yeah, yeah. I think like, there's always like a difference in the model quality. I would say like the original better model that we released this canvas was actually much more creative than even right now when I use like 4.0 with canvas. I think it's just like the complexity of like the data and the complexity of the, it's kind of like versioning issues right here. It's like, okay, like your version. 11 will be very different from like version eight, right? It's like, even though like the stuff that you put in is like the same or something.swyx [00:35:32]: It's a good time to, to say that I have used it a lot more than three times. I'm a huge fan of canvas. I think it is, um, yeah, like it's weird when I talk to my other friends, they, they don't really get it yet or they don't really use it yet. I think because it's maybe sold as like sort of writing help when really like it's kind of, it's the scratch pad. Yeah. What are the core use cases or like, yeah.Karina [00:35:53]: Oh yeah. I'm curious. Literally draft.swyx [00:35:54]: Drafting anything like I want to draft like copy for my conference that I'm running, like I'll put it there first and then I like, it'll just have the canvas up and I'll just say what I don't like about it and it changes. I will maybe edit stuff here and paste in. So, so for example, like I wanted to draft a brainstorm list of reasons of signs that you may be an NPC just for fun, just like a blog post for fun. Nice. And I was like, okay, I'll do 10 of these and then I want you to generate the next 10. So I wrote 10. I placed it in it to, to chat GPT. Okay. And they generated the next 10 and they all sucked, all horrible, but it also spun up the canvas with, with the blog posts and I was like, okay, self-critique why your output sucks and then try again. And it, and it just kind of just iterates on the blog posts with me as a writing partner and it is so much better than, I don't know, like intermediate steps. I was like, that would be my primary use case literally drafting anything. I think the other way that I'll put it, I'm not putting words in your mouth. This is how I view what canvas is and why. It's so important. It's basically an inversion of what Google docs is, wants to do with Gemini. It's like Google docs on the main screen and then Gemini on the side and right now what chat GPT has done is do the chat thing first and then the docs on the side, but it's kind of like a reversal of, of what is the main thing. Like Google docs starts with the canvas first that you can edit and whatever, and then you maybe sometimes you call in the AI assistants, but chat GPT, what you are now is you're kind of AI first with these, the site output being Google docs.Karina [00:37:22]: I think we definitely want to improve. Like writing use case in terms of like, how do we make it easier for people to format or like do some of the editing? I think there is still a lot of room for improvement, to be honest. I think the another thing is like coding, right? I feel like one of the things that'd be like doubling down is actually like executing code inside the canvas. And there's a lot of questions like, how do you evolve this? It's kind of like IDE for both. And I feel like this is where I'm coming from is like the chat GPT evolves into this blank image. It's kind of like the interface, which can morph itself in whatever you trying, like the model should try to like derive your true intent and then modify the interface based on your intent. And then if you like writing, it should become like the most powerful, like writing IDE possible. If it's like coding, it should become like a coding IDE or something.swyx [00:38:14]: I think it's a little bit of a odd decision for me to call those two things, the same product name, because they're basically two different UIs. Like one is code interpreter plus plus. The other one is canvas. Yes. I don't know if you have other thoughts on canvas.Alessio [00:38:27]: No, I'm just curious, maybe some of the harder things. So when I was reading, for example, forcing the model to do targeted edits versus like for rewrite, it sounds like it was like really hard in the AI engineer mind. Maybe sometimes it's like just pass one sentence in the prompt. It's just going to rewrite that sentence. Right. But obviously it's harder than that. What are maybe some of the like hard things that people don't understand from the outside and building products like this?Karina [00:38:50]: I think it's always hard with any new like product feature. Like. Canvas or tasks or like any other new features that you don't know how people would use this feature. And so how do you even like build evaluations that would simulate how people would use this feature? And it's always like really hard for us. Therefore, like we try to like lean on to like iterative deployment this in order to like learn from user feedback as much as possible. Again, it's like we didn't know that like code diffs was very difficult. For a model, for example, again, it's like, do we go back to like fundamentally improve like code diffs as a model capability, or do you like do a workaround where the model will just like rewrite the entire document, which is yield to like higher accuracy? And so those are like some of the decisions that we had to like make as yeah. How do you like improve the bar to the product quality, but also make sure the model. Quality is also a part of it. And like, what kind of like cheat offs you're okay to do? Again, I think, I think this is like new way of product development is more like product research, model training and like product development goes like together hand in hand. This is like one of the hardest things, like defining the entire like model behaviors. I think just like, is there's so many edge cases that might happen, especially when you like do canvas was like other tools, right? Like canvas plus Dalek. Canvas plus search. If you like select certain section and then like ask for search, like how do you build such evals? Like what kind of like features or like behaviors that you care the most about? And this is how you build evals.swyx [00:40:35]: You tested against every feature of ChatGPT? No. Oh, okay. I mean, I don't think there's that many that you can. Right. It will take forever.Karina [00:40:44]: But it's the same. It's indecision boundary between like Python, ADA advanced data analysis versus canvas. Is one of the most trickiest like decision boundary behaviors that we had to like figure out, like how do you derive the intent from the human user query? Yeah. And how do I say this? Deriving the intent, meaning does the user expect canvas or some other tool and then like make sure that it's like maximally like the intent was is like actually still one of the hardest problems. Yeah. Especially with like agents, right? Like you don't want like agents to go for like five minutes and do something on the background and then come back with like some mid answer that you could have gotten from like a normal model or like the answers that you didn't even want because it didn't have enough context. It didn't like follow up correctly.swyx [00:41:40]: You said the magic word. We have to take a shot every time you say it. You said agents.swyx [00:41:46]: So let's move to tasks. You just launched tasks. What was that like? What was the story? I mean, it's, it's your, it's your baby. SoKarina [00:41:52]: Now that I have a team, I actually like tasks was purely like my residence projects. I was mostly a supervisor. So I kind of like delegated a lot of things to my resident. His name is like Vivek. And I think this is like one of the projects where I learned management, I would say. Yeah. But it was really cool. I think it's very similar model. I'm trying to replicate canvas operational model. How do we operate with product people or like product applied orgs was research and the same happened. I was trying to replicate like the methods and replicate the operational process with tasks. And actually tasks was developed less than like two months. So if canvas took like, I don't know, four months, then tasks took like two months. And I think again, like it's kind of very similar process of like, how do we build eval? You know, some people like ask for like reminders in actual charge GPT, but then like, obviously, even though they know it doesn't work. Yeah. So like there is some like demand or like desire from users to like do this. And actually I feel like task is like simple feature in my opinion is something that you would want from any model. Right. But then the magic is like when I actually, because the model is so general, it knows how to use search or like canvas or like create cypher. You know, you can modify stories and create Python puzzles when coupled with status actually becomes like really, really powerful. It was like the same ideas of like, how do we shape the behavior of the model? Again, we shipped it as like as a better model in the model dropdown. And then we are working towards like making that feature integrated in like the core model. So I feel like the principles that like everything should be like in one model, but because of some of the operational difficulties, it's, it's much easier to like deploy. It's a separate model first to like learn from the user feedback and then iterate very quickly and then improve into the core model basically. Again, this is a project was also like together at the beginning from the very beginning, designers, engineers, researchers were working all together and together with model designers, we were like trying to like come up with like evals evaluations and like testing and like bug bashing. And it's like a lot of cool like synergy.swyx [00:44:12]: Evals, bug bashing. I'm trying to distill. Okay. I would love a canvas for this, for distill what the ideal product management or research management process is. Right. Start from like, do you have a PRD? Do you have a doc that like these, these things? Yes. And then from PRD, you get funding maybe or like, you know, staffing resources, whatever. Yes. And then prototype maybe. Yeah. Prototype.Karina [00:44:37]: I would say like prototype was prompted baseline. It's all, all, everything starts with like prompted baseline. Yeah. And then like we craft like certain like evaluations that you want to like capture. Okay. They want to like measure progress at least for the model and then make sure that evals are good and make sure that the prompted baseline actually fails on those like evals because then you have like, if you're allowed to like hill climb on. And then once you start iterating on the model training, it's actually very iterative. So like every time you train the model or you like look at the benchmark or like look at your evals and it like goes up, it's like good. But then also you don't want to like, you want to make sure it's not like super overfitting. Like that's where you run on other evals, right? Like intelligence evals or something. And then like. Yeah.swyx [00:45:20]: You don't want regressions on the other stuff. Right. Yes. Okay. Is that your job or is that like the rest of the company's job to do?Karina [00:45:26]: I think it's mainly my like. Really? The job of the people who like.swyx [00:45:30]: Because regressions are going to happen and you don't necessarily own the data for the other stuff.Karina [00:45:34]: What's happening right now is that like you, basically you only like update your, your data sets, right? So it's like you compare on the baseline, you compare like the regressions on the baseline model.swyx [00:45:47]: Model training and then book bash. And that's, that's about it. And then ship.Karina [00:45:50]: Actually, I did the course with Andrew Yang, who. Yes. There was like one little lesson around this. Okay.swyx [00:45:57]: I haven't seen. Product research. You tweeted a picture with him and it wasn't clear if you were working on a course. I mean, it looked like the standard course picture with Andrew Yang. Yes. Okay. There was a course with him. What was that like working with him?Karina [00:46:08]: No, I'm not working with him. I just like, I just like did the course with him. Yeah. Yeah.Alessio [00:46:11]: How do you think about the tasks? So I started creating a bunch of them. Like, do you see this as being, going back to like the composability, like composable together later? Like you're going to be scheduled one task that does multiple tasks chained together. What's the vision?Karina [00:46:27]: I would say task is like a foundational module, obviously to generalize to all sorts of like behaviors that you want. Like sometimes like I see like people have like three tasks.Karina [00:46:41]: And right now I don't think like the model handles this very well. I think that ideally we learn from like the user behavior and ideally the model will just be more proactive in suggesting of like, oh, I can either do this for you every day because I've observed that you do that every day or something. So it's like more becomes like a proactive behavior. I think right now you have to be more explicit, like, oh yeah, like every day, like remind me of this. But I think like the, the ideally the model will always think about you on the background and like kind of suggests, okay, like I noticed you've been reading some of this particular like how I can use articles. Maybe I can try to suggest you like every day or something. So it's like, it's just like much more like of a natural like friend, I think.swyx [00:47:35]: Well, there is an actual startup called Friend that is trying to do that. Oh, Yes. We'll have, we'll interview Avi at some point. But like it sounds like the guiding principle is just what is useful to you. It's a little bit B2C, you know, is there any B2B push at all or you don't think about that?Karina [00:47:51]: I personally don't think about that as much, but I definitely feel like B2B is cool. Again, I come back to like Cloud and Slack. It's like one of the, like the first like interfaces where like the model was operating inside your organization, right? It would be very cool for the model to like handle that. To like become like a productive member of your organization. And then either like even like even process, like I right now, like I'm thinking like processing like user feedback. I think it'd be very cool if the model would just like start doing this for us and like we don't have to hire a new person on this just for this or something. And like you have like very simple like data analysis or like data analytics or like how this features like.swyx [00:48:36]: Do you do this analysis yourself? Or do you have a data science team that tells you insights?Karina [00:48:40]: I think there are some data scientists. Okay.swyx [00:48:43]: I've often wondered, I think there should be some startup or something that does automated data insights. Like I just throw you my data. You tell me. Yeah. Yeah, exactly. Cause that's what the data team at any company does. Right. Which is just give us your data. We'll like make PowerPoints. Yeah. Yeah.Karina [00:48:59]: That'd be very cool.swyx [00:49:00]: That's, I think that's a, that's a really good vision. You had thoughts on agents in general. There's some more proactive stuff. You actually had tweeted a definition. Which is kind of interesting.Karina [00:49:09]: I did.swyx [00:49:10]: Well, I'll read it out to you. You tell me. Okay. If you still agree with yourself. This is five days ago. Agents are a gradual progression of tasks, starting off with one-off actions, moving to collaboration. Ultimately fully trustworthy long horizon. I know it's, I know it's uncomfortable to have your tweets read to you. I have had this done to me. Ultimately fully trustworthy long horizon delegation in complex environments like multiplayer, multi-agents, tasks, and canvases fall within the first two. What is the third one?Karina [00:49:34]: One of my weaknesses is like, I like writing long sentences. I feel like that's a good thing. Like I need to like learn how to.swyx [00:49:39]: That's fine. That's fine. Is that your definition of agents? Like what are you looking for?Karina [00:49:43]: I'm not sure if this is my definition of agents, but I feel like it's more like how I think it makes sense, right? Like I feel like for me to like trust an agent with my passwords or my credit card, I actually need to build trust with that agent that it will handle my tasks correctly and reliably. And the way I would go about this is how I would naturally like collaborate with other people. Is it like we first, even if it's any project, right, like we first came, when we first come, like we don't even know each other. Like we don't know how each other's like working style, like what I prefer, what do they prefer, how do they prefer to communicate, et cetera, et cetera. So like you spend like the first, like, I don't know, like two weeks to just like learn their style of working. And then like over time you adapt to their working style and then this is how you create the collaboration. And then like at the beginning you don't have much trust. So like how do you build more trust, especially like, it's the same thing as like with a manager, right? Like it's like, how do you build trust with your manager? What does they need to know about you? What do you need to know about them? Over time as you build trust and trust builds either through collaboration, which is why I feel like building Canvas was kind of like the first steps towards like more collaborative agents. I think with humans, so like you can, you should need to show a consistency. Yeah. Consistent effort to each other, like consistent effort that you care about each other is that you like work together very well or something. So consistency and like collaboration is like what creates trust. And then I will naturally will try to delegate tasks to a model because I know the model will not fail me or something. So it's kind of like building out like the intuition for the form factor of like new agents. Because sometimes I feel like a lot of researchers or like people in AI community are like so, into like, yeah, agents, delegate everything like blah, blah, blah, but like on the way towards that, I think like collaboration is actually one of the main roadblocks or like milestones to get over. Because then you will learn some of the implicit preferences that would help you, that would help towards like this full delegation model. Yeah.swyx [00:51:55]: Trust is very important. I have an AGI working for me and I, we're, we're still working on the trust issues. Okay. Um, we are recording this just before the launch of the podcast. We have a collaborative operator. The other side of agents that is very topical recently is computer use and topic launch computer use recently. Um, you know, you're not saying this, but opening is rumored to be working on things and like, there's a lot of labs are like exploring this, like sort of drive a computer generally. Um, how important is that for agents?Karina [00:52:23]: I think it would be one of the core capabilities of agents. Yeah. Computer using, oh, agents using desktop or like your computer is like the delegation part. So like when you might want to like delegate an agent to like order a book for me or like order a flight or like search for a flight and then order things. And I feel like this idea was flying around like for a long time since at least like 2022 or something. And finally we are here. It's just like there's a lot of like lag between idea and like full execution in the orders like two to three years.swyx [00:53:01]: The vision models had to get better. Yeah. A lot better.Karina [00:53:04]: The perception and something. But I think like it's really cool. I feel like it has like implications for like consumers definitely like delegation. But I guess again like I think like latency is like one of the most important factors here. It's like you don't want to make sure that the model correctly understands what you want. And then if it doesn't understand or if it doesn't know like full context, it should like ask for a follow up question and then like use that to perform the task. Like the agent should know if it has enough information to complete the task at the maximal, if it's a maximal success or not. And I think this is like still an open kind of like research question I feel like. Yeah. And the second idea is that like I think it also enables new class of like research questions of like computer use agents. Like can we use it in RL? Right. Like this is kind of like very cool like nascent area of like research.swyx [00:53:59]: What's one thing? What's one thing that you think by the end of this year people will be using computer use agents a lot for?Karina [00:54:05]: I don't know. It's really hard to predict. I'm trying to look for.swyx [00:54:09]: Maybe for coding.Karina [00:54:11]: I don't know.swyx [00:54:11]: For coding?Karina [00:54:12]: I think like right now like with Canvas we are thinking about like this paradigm of like real time collaboration to like asynchronous collaboration. So it's like it would be cool if I can just delegate to a model like, okay, can you figure out like how to do this feature or something? And then the model can just like. Test out that feature in its own like virtual environment or something. I don't know. Like maybe this is a weird idea. Obviously, there will be a lot of use cases around the consumers, the consumer use cases like, hey, like shop for me or something.swyx [00:54:43]: I was going to say, everyone goes to booking plane tickets. That's like the worst example because you only booked plane tickets, what, two or three times a year? Or like concert tickets.Karina [00:54:50]: I don't know. Yeah.swyx [00:54:51]: Concert tickets. Yeah.Karina [00:54:51]: Like Taylor Swift.swyx [00:54:52]: I want a Facebook marketplace bought that just scrolls Facebook marketplace for free stuff. Yeah. And then just go and get it. Yeah.Karina [00:55:00]: I have a question. I don't know. What do you think?swyx [00:55:01]: I have been very bearish in computer use because they're slow, they're expensive, they're imprecise, like the accuracy is horrible. Still, even with Anthopics new stuff, I'm really waiting to see what opening I might do to change my opinions. And really what I'm trying to do is like Jan last year versus December last year, I changed a lot of opinions. What am I wrong about today? And computer use is probably one of them where I'm like, I don't think, I don't know if by end of the year we'll still be using them. Will my ChatGPT have? Like every GPT instance, will they, will they have a virtual computer? Maybe? I don't know. Coding? Yes. Because he, he invested in a company that does, does that for the, the code sandboxes there. There are a bunch of code sandbox companies. E2B is the name. But then like in browsers, yes. Computer use is like coding plus browsers, plus everything else. There's a whole operating system and it's very like, you have to be pixel precise. You have to OCR. Well, I think OCR is basically solved, but like pixel precise and like understand the UI of what you're operating. And like, I don't know if the models are, I don't know. There you go.Karina [00:56:01]: Yeah. Yeah. Two questions. Like, do you think the progress of like mini models, like O3 mini or like O1 mini, I guess like it's came back to like the cloud, cloud 3 high cool, cloud 1.2 instant, like this like gradual progression of like small models becoming really powerful, which are very also like fast. Like I'm sure like the computer use agents like would be able to like couple with like those like small models that will solve some of the latency issues, in my opinion. I think in terms of like other operating system, I think a lot about it these days, it's just like, if you're entering this like task oriented, like operating system or something, where also a generative OS, like in my opinion, like people in like few years will click on like websites way less. I want to see the plot of like website clicks over time. But then my prediction is like, it will click. It will go down and like people's access to the internet will be through the model's lens. Either you see what the model is doing or you don't see what the model is doing on the internet. Yeah.Alessio [00:57:10]: I think my personal benchmark for computer use this year is expense reports. So I have to do my expense report every month. But what you need to do. So for example, I expense a lunch, I have to go back on the calendar and see who I was having lunch with. Then I need to upload the receipt of the lunch and I need to tag the person. The expense report, blah, blah, blah. Yeah. It's very simple on a task by task basis. Yeah. But like you have to go to every app. Right. That I use. You have to go to like the, you know, Uber app. You have to go to the camera roll to get the photo of the receipt, all these things. It's not, you cannot actually do it today, but it feels like a tractable problem. You know that probably by the end of the year we should be able to do it.Karina [00:57:49]: Yeah. This reminds me of like the idea of you kind of want to show to computer use agents how you would want. How you want or how you like booking your flights. It's kind of like a few shot. Yeah.swyx [00:58:03]: Demonstration.Karina [00:58:04]: Demonstrations of like maybe there is more efficient way that you do things that the model should learn to do it in that way. And so it's kind of like, again, comes back to like personalized tasks too is like right now task is just like where you're like rudimentary, but in the future tasks should become like much more personalized for your preferences.swyx [00:58:27]: Okay. Well, we mentioned that. Oh, I'll also say that I think one takeaway I got from your, this conversation is that ChatGPT will have to integrate a lot more with my life. Like you, you, you will need my calendar. You will need my email. Yes. Like for sure. And maybe you use MCP. I don't know. Have you, have you looked at MCP?Karina [00:58:43]: No, I haven't.swyx [00:58:44]: It's good. It's got a lot of adoption. Okay.Alessio [00:58:47]: Anything else that we're forgetting about or like maybe something that people should use more? Yeah. I don't know. Before we wrap on like the open AI side of things.Karina [00:58:56]: I think. I think like search product is kind of cool, like ChatGPT search. I think this idea of like, you know, like right now I'm thinking a lot of us, like, you know, the magic of ChatGPT when it first came out, it was like, you know, you ask something, any like instruction, and then like, it would like follow the instruction that you gave to a model, right? Like write a poem and we'll give you a poem. But I think like the magic of the next generation of ChatGPT is like actually, and we're like, we're marching towards that. It's like, when you ask a question, it's not just a question. It's not just going to be in the text output. The ideal output might be like in some form of like a react app on the fly or something. So like, this is happening with like search, right? Like give me like Apple stock and then it gives you the chart and gives you like this like generative UI. And I feel like this is what I mean by like the evolution of ChatGPT becomes like more of a generative OS with a task orientation or something. So it's like, and then UI will adapt to what you like. So like, if you really like 3D, what do you like? If you really like 3D visualizations, I think the model should give you as much visualization as possible. Like, you know, if you really like certain way of like the UIs, like maybe you like round corners. I don't know. It's just like some color schemes that you're like, it's just like the UI becomes like more dynamic and like becomes like a custom, custom model, like personal model, right? Like from personal computer to like a personal model, I think. Yeah.swyx [01:00:20]: Takes overall, you are one of the rare few people, actually, maybe not that rare. To work at both OpenAI and Anthropic.Karina [01:00:28]: Not anymore. Yeah.swyx [01:00:31]: Cultural difference. What are general takes that people like only like you see?Karina [01:00:35]: I love both places. I think I've learned so much at Anthropic and I'm really, really grateful to the people and I'm still like friends with a lot of people there. And I was really sad when John left OpenAI because I came to OpenAI because I wanted to work with the most or something. What's he doing now? But I think it changed a lot. So I think like... When I first joined Anthropic, they were like, I don't know, 60, 70 people. When they left, they were like 700 like people. So it's like a massive like growth. OpenAI and Anthropic is different in terms of like more like maybe like product mindset. Maybe OpenAI is much more willing to take some of the product risks and explore different bets. And I think Anthropic is much more focused and they have... I think it's fine. Like they have to like prioritize, but they definitely double down on like enterprise might be more than like consumers or something. I don't know. It's just like some of the product mindsets might be different. I would say like research, I've enjoyed like both like research cultures, both at Anthropic and like OpenAI. I feel like they are more... On the daily basis, I feel like it's more similar than different.swyx [01:01:50]: I mean, no surprise.Karina [01:01:52]: Like how you run experiments is kind of like very similar. I'm sure the Anthropic...swyx [01:01:55]: I mean, you know, Dario used to be VP research, right? So he set the culture at OpenAI. So yeah, it makes sense. Maybe quick takes on people that you mentioned. Barrett, you mentioned Mira. Like what's one thing you learned from Barrett, Mira, Sam, maybe? Something like that. Like one lesson that you would share to others.Karina [01:02:13]: I wish I like worked with them way longer. I think what I've learned from Mira is actually her like interdisciplinary mindset. She's really good at like connecting dots. Between like product and like kind of balancing like product research and like create this like comprehensive, like coherent story. Because sometimes like there are like researchers who like really hate doing product and there are researchers who really love doing product. And it's like kind of dichotomy between two and also like safety is like a part of this process. So kind of, you kind of want to like create this coherent, like think from like systems perspective. Or like think about like bigger picture. And I think I learned a lot from her on that. I definitely feel like I have much more creative freedom at OpenAI. And that's because the environment that the leaders set like enables me to do that. So it's like if I have an idea, if I want.swyx [01:03:10]: Propose it. Yeah, exactly. On your first month.Karina [01:03:11]: There's like more like creative freedom and like resource reallocation. Especially in research is like being adaptable to like new technologies and like change your views based on that. Yeah. Like you know, I've seen a lot of like researches that are like based on like empirical results or kind of like change the research directions. I've seen a lot of like, sometimes I've seen researchers who would just like get stuck on the same directions for like two to three years and they would never like work out or something, but they would still be like stubborn. So it's like adaptability to like new directions and like new paradigms. It's kind of like one of those things that-Alessio [01:03:42]: This is a Barrett thing or this is a general culture thing?Karina [01:03:45]: A general kind of culture, I think. Cool.Alessio [01:03:46]: Yeah. And just to wrap up, we just usually have a call to action.Alessio [01:03:52]: Do you want people to give you feedback? Do you want people to join your team?Karina [01:03:56]: Oh yeah, of course. I'm definitely hiring for like research engineers who are like more product minded people. So it's like people who know how to train the models, but also like interested in like deploying into like the products and developing like new product features. I'm definitely looking for those archetypes of like research engineers or like research scientists. So yeah. If you're like looking for a job, if you're like interested in joining my team, I'm like really looking forward to that. I'm definitely happy to just reach out, I guess.swyx [01:04:24]: And then just like generally, what do you want people to do more of in the world, whether or not they work with you, like, you know, call to action as in like everyone should be doing this.Karina [01:04:32]: I think this is something that I tell to a lot of like designers is that like, I think people should like spend more time just like play around with the models. And the more you play with a model, the more creative ideas you'll get around like what kind of like new potential features of the products or like new kinds of things. Kind of like interaction paradigms that you might want to create with those models. I feel like we are bottlenecked by like human creativity on like completely changing the way we think about the internet or like some of the, the way you think about software, like AI right now is pushes us to like rethink everything that we've done before in my view. And I feel like not enough people are either double down on like those ideas or I'm just like not seeing a lot of like human creativity in this like. Interface design or like product design mindsets. So I feel like it'd be really great for people to just like do that. And especially right now it's like research, some research becomes like much more product oriented. So it's like you actually can train the models for the things that you want to do in a product or something. Yeah.swyx [01:05:41]: And you define the process now. Now this is my go-to for how to manage a process. I think it's pretty common sense, but it's nice to hear from you that cause you actually did it. That's nice. Thank you for driving innovation, interface design and the new models at OpenAI and Anthropic. And we're looking forward to what you're going to talk about in New York. Yeah.Karina [01:06:01]: Thank you so much for inviting me here. I hope my job will not be automated by the time.swyx [01:06:06]: Well, I hope you automate yourself and we'll do whatever else you want to do. That's it. Thank you. Awesome. Thanks. Get full access to Latent.Space at www.latent.space/subscribe
    --------  
    1:08:40

Meer Technologie podcasts

Over Latent Space: The AI Engineer Podcast

The podcast by and for AI Engineers! In 2023, over 1 million visitors came to Latent Space to hear about news, papers and interviews in Software 3.0. We cover Foundation Models changing every domain in Code Generation, Multimodality, AI Agents, GPU Infra and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from OpenAI, tiny (George Hotz), Databricks/MosaicML (Jon Frankle), Modular (Chris Lattner), Answer.ai (Jeremy Howard), et al. Full show notes always on https://latent.space www.latent.space
Podcast website

Luister naar Latent Space: The AI Engineer Podcast, Bright Podcast en vele andere podcasts van over de hele wereld met de radio.net-app

Ontvang de gratis radio.net app

  • Zenders en podcasts om te bookmarken
  • Streamen via Wi-Fi of Bluetooth
  • Ondersteunt Carplay & Android Auto
  • Veel andere app-functies
Social
v7.8.0 | © 2007-2025 radio.de GmbH
Generated: 2/18/2025 - 10:40:57 PM