Horizon Newsletter • September 23, 2024
Horizon Live 2024 - "AI Moment" Presentation

In the lead-up to XBE's SuperPower announcement, Sean Devine delivered a presentation on the "AI moment" driving these innovations.

If you're seeking a concise overview of current AI model capabilities and insights into their significance for the XBE community, this presentation is for you.

Sean Devine Presentation

Watch the presentation on YouTube.

A transcript of the presentation is provided below.


All right, well, thanks everyone for all the participation today. I'm going to cover the next two sessions. This first one, as Grant said, is going to be a bit of a lead-up to make sure that we're all on the same page—both about what's currently true with regard to artificial intelligence and what we can expect based on the last two or three years to happen, or at least a range of expectations over the next couple of years. Then we're going to take a short break, and that'll transition into our Horizon product announcement. But let's start on the AI side.

I started the day saying that my goal was to mismanage expectations, and I'm going to keep it going here to say that we are in the moment as it relates to AI. In that, I would bet my career that 15 years from now, we'll look back at this year as the year that everything changed permanently. Depending on your involvement in the details, that could be—or feel—very true to you already, or it could feel like that's crazy, or somewhere in between. So I wanted to sort of walk through what the current state of things is so that we're all on the same page, and then lead into where things are perhaps going after that.

Okay, so first, let's just level set on the current capabilities of the AI models. I know from speaking on the topic with many people around the country that a room like this is going to tend to have maybe 80% of folks who have some experience with the current AI models, often through something like ChatGPT. Way fewer—usually somewhere in the 10% to 20%—have exposure to the most capable AI models, which is kind of like saying that 80% of people have talked to a high school sophomore and 20% have talked to a grad student about an interesting topic. So I want to just sort of describe the capabilities of the current models, and then we'll get into some other details.

There's something called a foundation model in AI, and those are the core large language models that are behind most of the AI-related innovations that you see. The first big advancement in large language models was made by OpenAI. OpenAI was a research company originally set up by a number of people, including Elon Musk, that then changed into being more of a for-profit company over the last few years, and the ownership changed, etc.

OpenAI built something called first GPT-2 and then GPT-3, and now they have a model called GPT-4. GPT-4 is the most capable—or at least until this past week, it was the most capable—large language model. Its capabilities are approximately equivalent knowledge-wise to a college graduate, and reasoning-wise to, say, a high school sophomore or so. In other words, it knows a ton, and it can apply that knowledge sort of inconsistently—pretty well—but can't reason about extremely complicated things, even though it knows a lot about them.

So OpenAI's GPT-4 now is the most capable large language model that exists. The way that GPT-4 was created is the same as all the rest of them, which is that basically OpenAI gathered all of the text on the internet, and then some—many open-source books, all of Wikipedia—so billions and billions of tokens of text. Then—and I won't go into the details of how this works—they trained it, which means they had a supercomputer work to understand the relationship between things and then represent that relationship in a billions-of-parameter model that sort of represents knowledge in this ultra-high-dimensional sort of space of numbers.

It's somewhat hard to believe that it all works, and in fact, we don't entirely understand how it does. I mean, there were theories about how neural networks work; those have panned out to be true. But it turns out that they scale much better than people imagined they would a few years ago. These scaling laws have been applied, and OpenAI first created this series of large language models that have an unbelievable knowledge about the world and, again, can reason a little bit, but they're more pattern-matching than reasoning.

So that's GPT-4, and that's by OpenAI, and that's what is underlying ChatGPT. So if you use ChatGPT, you're talking to GPT-4. Now, this may be the biggest breakthrough in the history of computing, and so obviously others are working on it too.

Google has an equivalent model to OpenAI's GPT-4 that's called Gemini. If you watch football or the Olympics or whatever, you've probably seen an advertisement that says the words "Google Gemini." So Google Gemini is both the name of the product that they're pitching out in the market, and it's also the name of the underlying large language model that's doing the thinking, so to speak.

So that's number two, and then the third is from a company called Anthropic, and Anthropic makes a model called Claude. There are different sizes of these, but Claude is approximately equivalent in capabilities to GPT-4 and to Gemini. The different models are good at different things—for example, Claude is excellent at programming, especially somewhat complicated things; GPT-4 is more consistently good at everything; Gemini has got a bit more of a personality that's sort of distinct.

So when people talk about AI, they are oftentimes specifically talking about one of those three large language models or products built on top of them.

Now, right now we're in a new phase where all three of those—well, two of the three models I just said, so GPT-4 and Gemini—have learned how to talk also. A lot of things I'm going to say sound sort of bananas, but you know, whatever, it's true. By "learn how to talk," what I mean is that it used to be that the models would transcribe audio into text and text back into audio. So they could kind of approximate a conversation just with that, right? Like I'd speak, it would record the audio, it would make that text, it would work on the text, it would generate text, and it would then create audio.

The newest models are what is called multimodal. What "multimodal" means is that the model thinks, so to speak, internally in both text and sound and pictures. So you don't have to do any translation back and forth. It's kind of like if you go to Spain and you learned Spanish in high school, you have to translate back and forth to English, whereas if you're fluent, you just think in Spanish. So the current models—the alpha version of them that aren't generally available but exist—can think not just in text but also in audio and also in pictures.

Now also, all three of the models that I said are multilingual. In other words, they naturally are fluent across dozens of languages at the very least and oftentimes many more than that. It started off that you had models—so the earlier GPT and Claude series and Gemini series models—could only operate primarily in English, and they were all text. Now, on the very sort of cutting edge of it, they generally can speak multiple—or all—the languages, practically speaking. The newest ones can think and create output in text, audio, and images.

Now on the image and audio side, the same techniques that have been used to create the large language models—and the main technique involved is, well, there are two. One is called a transformer model, and the sort of key innovation related to a transformer model is that in the training and inference process, the model can pay attention to itself. So it can, in parallel, sort of watch the information around it and learn from that. That breakthrough, which again is called the transformer approach, is what enabled all of what came next to happen.

There's another innovation that's sort of related to it but different, which is called diffusion, which the image models all use. The idea of diffusion was sort of built on the same innovations that came—or that led to—the big large language models that you're used to from ChatGPT, except they could work with images and take text and then sort of pull out of the noise of a blankish image pixels that were most related to the text that came in. Again, it's sort of hard to believe that it works, but many of the same sort of innovations that led to large language models also led to advancements in image generation.

Then next came audio generation. So at first, the models could speak very convincingly—you definitely have spoken to an AI voice and not known it in your life now. They are indistinguishable from people. But the same techniques have also been used to not just generate voice but to generate emotional voice. In other words, the current models can't—it's not that they can only sort of accurately produce the sounds that match the words, but they can speak with emotion. If they're surprised, they sound surprised; if they're confused, they sound confused; and so on and so forth.

Those same techniques have, in fact, also been used so that these models can generate sounds—not just speaking sounds but sound effects like car engines running or birds chirping or whatever—and even write and generate music. All of this has been happening over the last couple of years, and if you're sort of just paying attention to the news, you've seen some of it. But we are, as of September 24, at a place where the current state of things, both as it relates to text generation, image generation, and audio generation, is at a spectacular level on all three. We're almost to the spot—like literally it'll be this month—where those capabilities are getting merged into one, where instead of having different tools that are specialized that can do those things, you have a single model that can do all of them—all of them to a professional level of capability.

Now, I don't know if you've seen any videos of how that all plays out in what's called conversational AI, but I highly recommend after the event you take a look at the demos from OpenAI of the upcoming voice mode that some people have access to but it's not broadly available. But you can watch the videos of the real thing, of someone talking to these models like it's a person. I'm telling you, it is shocking how real the people—because, I mean, the knowledge has been real for a while—but the ability for the model to communicate back like a person is brand new and fundamentally different than what came before. That kind of has gotten layered on top, and you can see how we're stacking these capabilities that used to be separate into one unified thing that then is going to create a lot of upside we're going to talk about.

Now, the timing of this conference is perfect because last week I think the single largest breakthrough in the last two years was announced and released. I believe—again, it's hard when you've got something that just happened to speak with total confidence about how we'll feel about it in a year—but I believe we'll look back at OpenAI's o1—it's a bad name, but here we go—o1 reasoning model as the most important thing that's happened in the last couple of years. Let me describe what it does.

Remember, I said that the core large language models—so that's GPT-4 and Gemini and Claude—they all are very knowledgeable but not that smart. In other words, they know everything, but their ability to apply it is—I mean, it's impressive—but it would be short of what you could do, for sure. There's been a lack of clarity in the research about whether or how to crack the problem of making the models not just more knowledgeable but smarter.

As you train the models—so right now, let's say you increased by an order of magnitude the model—it would know much more, but it still would be pattern-matching on the world and not completely able to apply its training to new situations. It's like if you hire someone that went to school and is book-smart but can't really apply it in the real world—that's kind of what the models are like.

Anyways, the o1 model appears to have solved that problem, and they call it a reasoning model because of the following. The new model—instead of when you ask it a question just starting to answer—it spends time thinking, you know, like people should before they start speaking, but, you know, so goes with the models. You ask the question—I'll give some examples in a second—you ask the question, and you'll see it say "thinking," and in fact, OpenAI has, in ChatGPT, exposed this model. So if you use it—and I encourage you to do so tonight—you'll ask it a difficult question, and you can inspect its thinking process. So it'll sort of, like a person would, say what it's working on: "I'm doing this step. Okay, now I'm looking at what I did and comparing it to the instructions. Okay, I see I made a mistake; I'm going to redo it." You know, it's like that kind of thought—probably like a mental thought process.

So what it does is it thinks through the problem, it breaks it down, it goes step by step, it recurses—so it does that, it looks at what it's done, it starts over, it keeps going. By teaching the model to reason—that's, again, teaching the model not just to spit out what it knows but to take what it knows and churn on it a bit so that it can apply it in a new way—it can reason about problems.

So let me give an example. Even though the large language models like GPT-4 and Claude and Gemini are great at teaching you about the Roman Empire, if you ask them to count the number of R's in the word "strawberry," they'll get it wrong, because counting the number of R's in the word "strawberry" requires that you think about it—you look at it and think about it—but it doesn't know how to do that, right? Because "how many R's in the word 'strawberry'" isn't something that was in its pre-training.

If you ask it to solve a crossword puzzle—these are the pre-o1 models—it will fail, because solving a crossword puzzle requires that you guess and check and think things through and try things out and start over again. But a person—even a child—could solve a crossword puzzle that the current models, even though they know everything about the Roman Empire and every other topic under the sun, can't do—until now.

So the o1 model can solve the hardest crossword puzzle that the New York Times puts out. It can score 87, I think, on the Math Olympiad right now, even if you constrain it to the amount of time that is afforded through ChatGPT. If you give it more time to think—so in other words, don't change anything but say you can think for, let's say, 45 minutes on each problem—it'll score in the mid-90s on Math Olympiad. Math Olympiad is the hardest math contest in the world—literally the Olympics for math.

And so on and so forth. It can write software from scratch. If you said "implement the game Snake but do it with an Ozinga logo," it could do so, win the first shot, and it would work. So it has the same knowledge that GPT-4 has, but it's got this new trick where it can stop and think things through.

It's funny, you know, for years I've—one of my favorite videos to send to people, like work-wise, is this video of Cookie Monster, and he says, "Sometimes you've got to stop and think it through," and then he's teaching his sort of cookie brigade how to solve problems. It turns out that o1 is literally focused on that problem, which is sort of turn it on itself and to think. And it works.

Now, just for a little bit of background on this: if you remember, about nine months ago there was quite a bit of drama—and I doubt most people know this—but quite a bit of drama around OpenAI, where even though it's worth over $100 billion, it looked like it may blow up. And the reason, as far as I can tell at least, was that when the discovery was made—which they called "Strawberry," which is fun—when the discovery was made that led them to figure out how to teach the model to think, a couple of people related to it were so freaked out by how smart it got all of a sudden that they said, "I don't know, this is something else." And so, you know, there was negotiation about how to release it; it took a lot longer to release because they had to get it right, but last week they came out with o1.

And o1 has been used by very few people, and I think it's literally three or four days ago. The limit is, like, if you signed up today, you could ask 20 questions in a week. So it's very capacity constrained. But it is absolutely spectacular. Like, I've used it quite a bit, and you'll hear a bit more about that in a second. I'm completely convinced that it will solve the problem of how do businesses have these models actually do real work—not just provide reference back to people but take on tasks, like think things through.

To tell you a little bit about our own experience with this, there is a product called Cursor. Because, you know, if you think about these AI innovations—and the reason I went through the details is the lower-level models are like the engines of it all. But just having a great engine doesn't make a great car—just ask Grant. Grant, am I allowed to make fun of your car? No.

Anyways, I mean, having a great engine is not going to make the car. You have to build functionality on top of it. So Milind and I have been using a new tool, and it's called Cursor. And Cursor is built on the best models that we said—in fact, it can now use o1 Mini, as the smaller version of o1—I didn't mention that, I should have.

So o1 comes in two flavors: Mini doesn't know that much but is very smart; the full-size one knows a lot and is smart. And it's interesting, actually, to figure out that a lot of the time you can use something that's smart but doesn't know a lot pretty well. But anyhow, so this tool Cursor is now in the class of products that are built on top of these low-level capabilities, and Cursor helps you program.

So Milind and I have been working an unbelievable amount on some new things, and we have used for the last week and a half this new tool called Cursor. In part because I work too many hours, but mostly because of Cursor—for example, we keep track of this thing called the dev league every day that records our scores, and I think I recorded a score that was twice what I had ever recorded ever in a day three times inside of one week.

My personal guess is that I'm twice as productive as I was 11 days ago, and that 11 days ago I was, say, 50% to 75% more productive than I was a year before that. So do the math—that's like three and change times as productive as two years ago. And it's not just that—and this part is a weird thing—people like to say that these models make them faster, which they do, but if Milind and I are being honest, they also make us a lot better. Not just that we're doing the same work we would have done before, but we're doing better work than I would have done in some areas. Milind, who knows? But anyways.

So we have—and that's with the GPT-4 class models. So in other words, this productivity bump is before we got the new class of reasoning models that o1 provides. So my guess is we're going to see something like another doubling in productivity once the reasoning models are incorporated. So that's doubling again—that would be seven times more productive than two and a half years ago by the end of the year, is my guess. And that's like lived experience—that's not me forecasting; that's saying I'm already three and a half, and it's pretty easy to see, because I've used o1 for other reasons, that it will double again pretty shortly.

So it's worth asking the question, well, that's all great, but like, is this really cheaper than, you know, just a person spending longer doing it? Now, the cost of an output token of GPT-4—which is, again, the best foundation model—has gone down 99% in the past 15 months. They're 99% lower; what used to be a buck is now a penny. There's no reason to believe that that won't continue—we sort of know how to optimize the models once they exist. That means that the cost per hour—and I won't get into all the math on this—but you can kind of think about, you can reason about how many tokens of thinking and production a person does, and then sort of map it to these models.

But presently, the work that the models are capable of doing, they can do at something like 5% of the cost of a person—ish—5, 4, 3, you know, somewhere in the low to mid single-digit percentages. That's for the things that they could do. Now, what they could do has been limited to, you know, I mean, a number of good things—but, you know, writing and research and summarization, etc.—but once you add thinking in, the list is about to go up by, you know, minimum an order of magnitude.

So let's just talk a little bit more about that point. So all the things I've just said are currently true—in other words, there was no forecasting in what I said, except for that I think that I'll double my productivity again in the next few months, but I think that that's pretty informed based on 0.1 Mini. So that's all where we are now.

Now, the question is, do we know enough to believe that this is going to sustain itself? That in a year we'll have advanced on these capabilities at about the same rate that we have over the past year or two years? There's been a lot of debate about this, but I think with the release of o1, it's fairly clear that we've unlocked the key missing ingredient for the next three years of development—I mean, past that, who knows—but it's clear both that the costs are coming down, but also the capabilities are going to continue to increase.

Okay, so let's talk a bit about us and what have we done related to this. Back about two years ago, you could sort of see these scaling laws—maybe longer than that, maybe two and a half years ago—the scaling laws started to become clear, just as long as enough compute was applied to all this, that you're going to see the kind of trajectory—not that I had a crystal ball on all the details or anything—but it was pretty clear that the trajectory was going to be up and to the right for a while. So we decided on one big rule—and I mean, a lot of this is just background so far, but this is a rule that applies to this room—which is that I said I will stop building for what exists today and start building for what we expect to exist.

So I'll give the first example of that. We released Hey Kayla, which is our AI support chatbot. We were done with it in February of '23, and it used, when it was released, GPT-3.5, which was not good enough for it. We started working on it months before that, but the reason we did that is because we had this conviction of the scaling laws. And I said we're just going to build it, and then we'll release it with whatever model's out at the time, and soon enough it'll get better. So we released it in February '23, and then, I think it was six or eight weeks after that, GPT-4 was released. And so all this work that we did, which was in the right direction and promising, instantly became good enough almost overnight for a lot of uses, at the very least.

So that gave me the confidence, okay, we're going to keep doing this—we're just going to skate to where the puck is. We're not going to ask the question, what can the models currently do? We're going to assume that the future is going to progress at a similar rate to what it has, and we're going to keep building that way.

Now, the reason I'm saying that point is worth taking away is this point is true for all of the businesses represented by you in this room, which is I've had many conversations—hundreds of conversations—with people that question, well, the model is only good at X or Y or Z right now, and we really need for it to be good at D. I'm like, D is coming.

I mean, I remember going out to dinner with the Gallagher team—I just forgot about this story now—back, how long ago was that now? Long time ago. Two years? We had this conversation then. We were sitting in that little room at that restaurant. We said, listen, I'm just going to build like the future is here already. And so that's the way we've approached the last two and a half years, which is just to expect that the future is going to look like it does now. And now we're building expecting it to look like something better than it is already.

So we talked about that, you know, anticipating. And again, I think that the point here is you don't have to be able to anticipate the details to just be able to anticipate two facts—that one, or three facts: one, the models will be smarter; two, the models will know more; and three, they'll be cheaper. And so the cost doesn't matter—whatever the cost is now. Again, it went down 99% since we released Hey Kayla—99%. So it got way cheaper, the current models know way more, and now with 0.1, they're like many orders of magnitude—like two orders of magnitude—smarter.

So we've talked about these three a little bit, but let me just tell you a little more since not all the room would know the details. So Hey Kayla—that's the customer support chatbot—you know, you'll see these around in lots of products now. Basically, it has all of the content about what XBE can do—all the newsletters and release notes and other content—and takes a question, looks up all the related content that it thinks it may match, and answers the question and provides it back. So we released that; that's been a good success.

We then worked, as Grant said, with the National Asphalt Pavement Association and now the National Ready Mixed Concrete Association to build industry-specific chatbots that have their entire library of content. And that's been—and we just did that to kind of give something to the industry and get our name out there. And Hey NAPA has thousands of active users, and Ask Concrete is about to be released but is one of the key focal points of the upcoming ConcreteWorks.

This experience has been fantastic for us because it shows—well, it's taught us a lot in the details, right? Like how do we actually pull all of this off at scale and given us an idea of what people like and what they don't like and what works. We've also taken various models—both audio generation models, large language models, and image generation models, and some more specific sort of typical machine learning stuff—and integrated it across XBE to do all sorts of stuff in the background: to suggest things that are smart, to review time cards—you know, audit time cards automatically, etc. I don't know how many, maybe like 15 or 16 different features that we've built that just make using XBE smoother, kind of in the background. It's not a thing you interact with, but it's a thing that's happening.

And all of this is to say that we have really been putting the work in to be smart about this. Maybe because I'm fixated on it, maybe because it's clear that this is the next big frontier in computing, but we've worked for two and a half years to understand what every AI model is capable of and to try in dozens of ways how to take advantage of them and to build features that people benefit from and interact with. Because again, there's a difference between the car and the engine.

So that's a long way to say that we're excited in a few minutes to talk about what's next for XBE and everyone in the room. So we'll be back in a few.