An artist's rendering of esmGFP, a novel fluorescent protein created by ESM3. (Image credit: EvolutionaryScale)
Artificial intelligence (AI) simulated half a billion years of molecular evolution to develop the code for a previously unknown protein, according to a new study. The glowing protein, similar to those found in jellyfish and coral, could help create new drugs, the researchers say.
Proteins are some of the basic building blocks of life and perform many functions in the body, including building muscle and protecting against disease. The model protein, called esmGFP, exists only as computer code, but includes the blueprint for a new type of green fluorescent protein. In nature, green fluorescent proteins give fluorescent jellyfish and corals their unique glow.
The sequence of characters that form the instructions for making esmGFP is 58% identical to the closest known fluorescent protein, which is a human-modified version of a protein found in bubble sea anemones (Entacmaea quadricolor) — colorful jellyfish-like sea creatures with bubbles at the ends of their tentacles. The rest of the sequence is unique and would have required 96 different genetic mutations to evolve. The study indicates that it would have taken more than 500 million years for these changes to evolve naturally.
Researchers at EvolutionaryScale demonstrated esmGFP and the AI model used to create it, ESM3, in a preliminary study last year. Now, independent scientists have peer-reviewed the results, which were published Jan. 16 in the journal Science.
ESM3 doesn't limit itself to the usual evolutionary framework for designing proteins. Instead, it's a problem solver that fills in the missing pieces of an incomplete protein code provided by researchers, and in doing so creates something that could exist based on all possible evolutionary paths.
“We found that ESM3 probes the fundamentals of biology and can generate functional proteins outside the space explored by evolution,” study co-author Alex Reeves, co-founder and chief scientific officer of EvolutionaryScale, told Live Science in an email.
The new research builds on work Reeves and his colleagues began at Meta, the parent company of Facebook and Instagram, before launching EvolutionaryScale in 2024. ESM3 is the latest version of their generative language model, similar to OpenAI's GPT-4, which powers ChatGPT, but with a focus on biology.
Proteins are made up of chains of molecules known as amino acids, the sequence of which is determined by genes. Different proteins have different amino acid sequences and structural differences, each folding into a unique shape to perform its function, according to Nature Education. To help ESM3 understand proteins, the researchers entered data about key features of a protein — its amino acid sequence, structure, and function — as a series of letters.
The team trained ESM3 on data from 2.78 billion proteins found in nature. The researchers then randomly obscured parts of the protein blueprint and had ESM3 fill in the missing pieces to complete the code based on what it learned.
“Just as a human can fill in the missing words in the phrase ‘to _ or not to _, that is _’, we can train a language model to fill in the gaps in proteins,” Reeves said. “Our study showed that by solving this simple task, information about the deep structure of the protein’s biology emerges in the network.”
Scientists are already modifying natural proteins and creating new ones for a variety of purposes. For example, green fluorescent proteins are widely used in research labs. Their genetic code is often added to the ends of other DNA sequences to make the proteins they encode green. This allows scientists to easily track proteins and cellular processes. Reeves emphasized that ESM3’s capabilities could accelerate a wide range of applications in protein engineering, including the development of
Sourse: www.livescience.com