Podcast: Stanford Predicts The Next Human Evolution
In this episode, we explore how a generative AI tool is marking a major milestone in biology and accelerating advancements in healthcare, genetics, and drug development.
In this episode, we explore how a generative AI tool is marking a major milestone in biology and accelerating advancements in healthcare, genetics, and drug development.
Episode Notes
In this episode, we explore how a generative AI tool is marking a major milestone in biology and accelerating advancements in healthcare, genetics, and drug development.
Become a founding reader of our newsletter: http://read.thenextbyte.com/
Transcript
What's up folks, welcome back to the Next Byte podcast. And in this one, we're talking about DNA sequencing. Specifically, we're talking about a team over at Stanford that has created a ChatGPT-like AI model that can predict what comes next if you give it any DNA sequence, including yours. that's got you a little bit spooked or excited or just curious. Well, then let's get into it.
What's up friends, this is The Next Byte Podcast where one gentleman and one scholar explore the secret sauce behind cool tech and make it easy to understand.
Farbod: This is another one of those biology ones. So, you already know I'm gonna struggle, but that's okay. We learn when we struggle. I was…
Daniel: You just need to test your limits, man!
Farbod: And that's what I'm doing. I'm trying here, but it was so interesting that I knew we needed to talk about it. So, for the audience, please bear with me. This is not my area of expertise, but God am I trying. So, for this episode, we're going to Stanford. You know the West Coast is the best coast, but these folks at Stanford, they're doing some crazy things with DNA sequencing. And to be completely honest, they've kind of done this before, but now they're doing it better. So, what am I talking about? They've developed a chatGPT-like model, right? It's structured like a large language model, but instead of processing the meaning of the word that you're typing or the paragraph that you're writing or whatever and then giving you an answer or a follow-up or completing your train of thought, it takes as an input. A sequence of nucleotides, a sequence of DNA, and it completes it for you or it makes sense of it and it tells you what genes might be associated with each other. Now, before we get into that, I think we got to take it a couple steps back and by we, I mean me because I know I need to take it, a couple steps back. So, they did an amazing job in this article, by the way, explaining things like you're a five-year-old, which again is wonderful for me, and they have a three-minute video that I highly recommend you check out. It's going to be attached to the show notes. But they explained the issue at hand like this. They said, imagine you're reading the book, and this book has three billion letters in it, right? And every one of those letters is a nucleotide. These nucleotides make up the DNA that everything we know of in this world is pretty much composed of. And it's actually kind of straightforward. We represent them as four letters, A, C, G, and T. And standalone, they don't really mean much, but the sequence of them in different patterns together make up specific genes. Now, as we know, if you've been watching the news, this gene causes this, and poor, broken nucleotide sequences can, for most people, be harmless. For some people, it can cause cancer. So, there's a lot of value in understanding what those sequences mean, how they're related to each other, et cetera, et. Now, as I mentioned, this team had already worked on a chatGPT-like model before called the EVO-1 that was trained using 300 billion nucleotides, mostly from bacteria and stuff like that, to help them proof of concept this idea of, if we had a model that could predict what comes next. Could we better understand what the next evolutionary step would be for this plant or these bacteria or this human? Or could we better understand how the sequence that is at the one millionth element is related to the one at the 100,000th element? And if so, what does that tell us about the chances that you might have cancer? Or could we edit this out? The possibilities and the downsides of what already exists, basically.
Daniel: Well, and when we talked about the technological development of having sequenced the human genome several decades ago, that didn't necessarily crack the code for us on what everything in human DNA means. That was just us looking at the DNA and writing a list, writing that three billion letter long book of just A’s, C’s, G’s, and T’s. It's awesome that like there's only four letters, but essentially, we didn't understand how those letters comprise genes and genes, they liken it in the article to like words. So, like basically, the major development in sequencing the human genome was like figuring out the alphabet and the order of the letters in the alphabet. But we did not yet understand the vocabulary. We didn't understand the words. And beyond that, we don't understand the grammar about how you put several words together to create phrases. And that's how physical characteristics are expressed or through a combination of several different genes. So, this is essentially helping us bridge the gap. And like in my mind, truly cracked the code on the meaning of DNA. If we can start to predict protein form and function by looking at the order of the A's, C's, G's and T's, predict harmful slash beneficial genes, potentially create new genetic sequences for medicines or treatments like this. This is bridging the gap between the letters in the physical realm, whereas before it was awesome that we understood what word of the letters were in.
Farbod: You know, just like EVO2 is making sense of these nucleotide sequences, I'm happy you're here to make sense of my train of thought about biology.
Daniel: EVO2 wouldn't exist without EVO1, buddy.
Farbod: You're right. And like, I guess that frame of thinking is even better when comparing it to something like ChatGPT or really any other large language model, because the key breakthrough there, I think we mentioned it five episodes ago. The key breakthrough there was the transformer architecture, which said in relation to this wider body of text, what does this letter mean and this letter mean, instead of trying to make sense of a sequence of words back-to-back, which then lost meaning once you reach the 50th word. So again, they've taken the same underlying engine here, but now applying it to what does this sequence of genes mean, these nucleotides mean. And I think, I mentioned that EVO1 had 300 billion nucleotides, which already sounds quite impressive, but it was mostly bacteria and whatnot. Well, they upped it for EVO2, and now they include animals, plants. Overall, there's nine trillion nucleotides that are represented in this model. The only thing that they left out by design was viruses. The team said they didn't want this tool to be something that bad actors could use to develop and refine existing viruses, which I personally thank them. I think no one wants another round of the coronavirus. So, kudos for thinking ahead on that.
Daniel: And this is like, it's pretty monumental here. So, in addition to like excluding viruses, right? Basically, except for viruses, the training data now is 27 times bigger than before. It includes all known living species that we have a DNA sequence for, plus some that are extinct. So, it's like, it's pretty much everything that we've ever sequenced ever, except for viruses so that we don't have people going, creating designer viruses to go try and get people sick. But every living being that we have the DNA sequence for has been included in the training data for this next generation. That's a big deal to increase your training data size, your training data set by a factor of 27, right? It's not like they've trained on 10 % more sequences than before. This is a huge deal.
Farbod: Absolutely. And not only have they increased the number of data points that they're training on, but the input is also, what does ChatGPT call them, the number of tokens that you can input for a query? Yeah.
Daniel: Yeah.
Farbod: So, they've upped it to a one million letter sequence of nucleotides. So, you can have one million letters of combination of ACGT and it can then do whatever you want with it, analyze it, predict the next set of sequences, cetera, et. And I was wondering why one million? Apparently in the world of biology, the one million mark is, it's critical because it's long enough that it takes into account historically significantly far enough genes to make sense of rather than just taking into account the recent developments or deficiencies.
Daniel: That makes sense. I appreciate the significance around there. And this makes me feel more so. ChatGPT is an awesome analogy to give people, especially because I feel like a lot of people have had exposure to ChatGPT. But I'm hoping within our audience, maybe there's folks who've had a greater exposure to things like a co-pilot in coding or things like cursor composer. This actually feels a lot like cursor composer to me where in the cursor editor, you can go in the composer bar and you can have your code base living there, which is like a sequence of DNA. And you can tell it, hey, I want to generate a change to do this to the code and composer will go and make those changes based off of its understanding of the code. This feels a lot like that, right? So, some of the technical features that it has, similar actually again to cursor composer is like autocomplete. They make a big deal about it in cursor. They call them tabs, but basically like you're writing and it understands what you're trying to code and you can just hit tab and it completes it, right? Autocomplete for code. They have autocomplete for DNA here based off of patterns in DNA sequences. They also have prediction models that can understand real world function. So instead of saying like change all the A's to C's or you know, control F my ACGTs and change them to TCGAs, you can say, I actually just want to change how this physical characteristic appears in the real world. And then it'll go change the DNA to reflect that. And then the interesting part, which is really awesome in my mind, it tends to be somewhat controversial. But we've seen a lot of work with CRISPR gene editing, which basically allows us to like sequence DNA, fabricate it, and then inject it into a DNA sequence. And then in a living being and basically edit a living being's DNA. They're able to essentially predict what the outcome of CRISPR gene editing will look like in the real world to propose edits essentially propose future experiments and tests.
Farbod: Yes. That's what I gonna bring up I think, you know, the audience might have mixed feelings about that but what stood out to me? They were like look within this model with an EVO2 we can whenever you're predicting a new gene, we can tell you if it already exists if it doesn't if you implement it like what kind of behavior it could have, and then obviously you can do lab testing, put it into a living cell and see how it pans out. One thing that I thought was interesting in regards to CRISPR and gene editing, they were like, what if we could use this tool to modify bacteria to consume microplastics and to clean up oil spills, right? So again, I know it's a mixed feeling type of topic, especially as it relates to human beings, but the benefits of this tool are actually pretty significant. And even on the human-
Daniel: This doesn't necessarily dictate how the findings are applied, right? This is an engine for understanding DNA, right? That's what we should be excited about, is it's an engine for us to understand what's going on. Even if you decide not to go edit your genes, to get rid of a disease, you can understand what diseases you might be predisposed to based off of your DNA.
Farbod: Yeah, and you know, most of the stuff that they talked about are all the positives that we can all agree on, like modifying the bacteria. They said this can be used to significantly cut down on the discovery phase of new drug development because you can find which proteins work best for addressing whatever symptom. We talked about these two episodes ago, I think, on the snake venom episode, which has been a hit, by the way. They did mention that EVO2 could help with human evolution because we could predict what genes could be beneficial to us, but that's a topic for another day. I think what we should focus on here is that this is an incredible platform that they have open source. It's available to all researchers that they want others to use in their workflow and build on top of so that we can make better drugs and help our environment.
Daniel: Well, and I think two really important things to mention. And you mentioned one of them and I kind of want to double down on it. One of them is direct quote from the researchers. They said it's really, really good at drug discovery, which is awesome. So obviously, helping identify treatments for potential diseases is awesome. But one of the other things that they can mention is like they, this model can distinguish between harmless DNA mutations to disease, disease causing ones. So, where we've got a lot of concern around free radicals and/or mutations in your DNA, potentially cancer causing, as you mentioned before, is like, we're trying to limit exposure to anything that could cause any type of mutation in your DNA. And that's where we get a lot of conflicting advice from people like, oh, like don't put on sunscreen, but also don't get sunburned and don't eat this and don't eat that. And don't do that. Like it feels like you're, you're screwed either way. I feel like this will potentially help us get more clarity around which types of causes for DNA mutations are most likely to cause disease and then help us hone our focus on like, what's the 80-20? What's the Pareto principle? Like what's the 20% of effort that I can do to protect myself from 80% of bad outcomes? I feel like that's one of the most important things that you and I will experience in our day-to-day life is a little bit more clarity around a lot of the conflicting guidance that we get. I think the most specific one that I think of is like around sun exposure. Like they're like, you need sun exposure for vitamin D, but don't get so much sun exposure that it causes mutations in your skin can cause melanoma. So, if you're outside, make sure you put on sunscreen, but also sunscreen can like cross through the blood brain barrier and cause disruptions in your hormones. Try and give us some clarity around a lot of the conflicting advice like that, where it feels like you can't win either way. Like, hopefully this will help us understand more what changes to make in a day-to-day life that actually have an impact on the presence of disease in our DNA.
Farbod: Dude. Yeah, I totally agree that was a good hot take that I completely missed. And the sunscreen one, I'm totally with you. I think about it every morning when I put on my sunscreen. With that said, I'm going to wrap things up. So, folks, if you've heard about DNA sequencing and what a miracle it was that we cracked it a couple decades ago, you might be shocked to hear that it didn't crack the code all the way. Look, we understand what the nucleotides are that make up our DNA. You know, the four different ones, the A, C, G and T. But these folks over at Stanford are now developing a model or have already developed a model that can make sense of the different genes that are part of this DNA sequence and how they interact with each other. But more importantly, almost like you're using ChatGPT, it can predict what comes next. Now, why is this valuable? Well, by understanding the relation between the different genes, they can understand what is harmful, what can be used to develop new tools, how we can modify an existing sequence to give us beneficial properties. For example, one they called out was modifying bacteria to eat microplastics or clean up oil spills. And the next bit, it could even help us understand what good way we can go for evolving the human genome so that we get all the best properties possible.
Daniel: That's Pretty fireman.
Farbod: Sweet. That's the pod.
Daniel: All right. See you.
As always, you can find these and other interesting & impactful engineering articles on Wevolver.com.
To learn more about this show, please visit our shows page. By following the page, you will get automatic updates by email when a new show is published. Be sure to give us a follow and review on Apple podcasts, Spotify, and most of your favorite podcast platforms!
--
The Next Byte: We're two engineers on a mission to simplify complex science & technology, making it easy to understand. In each episode of our show, we dive into world-changing tech (such as AI, robotics, 3D printing, IoT, & much more), all while keeping it entertaining & engaging along the way.