In their search for new disease-fighting medicines, drug makers have long employed a laborious trial-and-error process to identify the right compounds. But what if artificial intelligence could predict the makeup of a new drug molecule the way Google figures out what you’re searching for, or email programs anticipate your replies—like “Got it, thanks”?
That’s the aim of a new approach that uses an AI technique known as natural language processing—the same technology that enables OpenAI’s ChatGPT to generate human-like responses—to analyze and synthesize proteins, which are the building blocks of life and of many drugs. The approach exploits the fact that biological codes have something in common with search queries and email texts: Both are represented by a series of letters.
Proteins are made up of dozens to thousands of small chemical subunits known as amino acids, and scientists use special notation to document the sequences. With each amino acid corresponding to a single letter of the alphabet, proteins are represented as long, sentence-like combinations.
Natural language algorithms, which quickly analyze language and predict the next step in a conversation, can also be applied to this biological data to create protein-language models. The models encode what might be called the grammar of proteins—the rules that govern which amino acid combinations yield specific therapeutic properties—to predict the sequences of letters that could become the basis of new drug molecules. As a result, the time required for the early stages of drug discovery could shrink from years to months.
“Nature has provided us with tons of examples of proteins that have been designed exquisitely with a variety of functions,” says Ali Madani, founder of ProFluent Bio, a Berkeley, Calif.-based startup focused on language-based protein design. “We’re learning the blueprint from nature.”
Protein-based drugs are used to treat heart disease, certain cancers and HIV, among other illnesses. In the past two years, companies including Merck & Co., Roche Holding AG’s Genentech and a number of startups like Helixon Ltd. and Ainnocence have begun to pursue new drugs with natural language processing. The approach, they hope, will not only boost the effectiveness of existing drugs and drug candidates but also open the door to never-before-seen molecules that could treat diseases like pancreatic cancer or ALS, for which more effective medicines have remained elusive.
“Technologies like these are going to start addressing areas of biology that have been ‘undruggable,’” says Sean McClain, founder and CEO of Absci Corp., a drug discovery company in Vancouver, Wash.
The lab at Absci Corp., which is working with Merck to explore new methods of designing medicines.
Natural language processing for drug discovery still faces major hurdles, according to computational biologists. Tinkering too much with existing protein-based drugs could introduce unintended side effects, they say, and wholly synthetic molecules will require rigorous testing to make sure they’re safe for the human body.
But if the natural-language algorithms work as their adopters hope, they will bring new force to the promise of artificial intelligence to transform drug discovery. Previous attempts to use AI struggled with limitations in the technology or a lack of data. Recent advances in natural language processing and a dramatic drop in the cost of protein sequencing, which has yielded vast databases of amino-acid sequences, have largely overcome both problems, proponents say.
With the technology still in the early stages, companies for now are focused on using protein-language models to enhance known molecules, such as to improve the efficacy of drug candidates. Given, say, a naturally occuring monoclonal antibody as a starting point, the models can recommend tweaks to its amino acid sequence to improve its therapeutic benefit.
In a pre-print paper published online in August, researchers at Absci used this method to enhance the antibody-based cancer drug trastuzumab so that it binds more tightly to its target on the surface of cancer cells. A tighter bind could mean patients derive benefit from a lower dosage, shortening drug regimens and reducing side effects.
In another paper published in March in the Proceedings of the National Academy of Sciences, researchers from MIT, Tsinghua University and Helixon, which is based in Beijing, used protein-language models to transform a Covid-19 drug candidate that’s only effective against alpha, beta and gamma variants into one that could also treat delta.
Ainnocence, a startup that spans the U.S. and China, helps clients use such models to modify animal proteins, such as antibodies from rabbits—a common starting point for drug discovery—into forms compatible with human physiology, according to the company’s founder and CEO, Lurong Pan.
But even now drugmakers are setting their sights beyond the modification of known proteins to so-called de novo design, the process of synthesizing molecules from scratch.
Genentech says a recent experiment showed that it was possible to design an antibody to bind to the same cellular target as pertuzumab, a breast cancer drug on the market that Genentech sells under the brand name Perjeta, but with an entirely new amino acid sequence. Company scientists gave its protein-language models only the target and the antibody’s desired three-dimensional shape–the primary determinant of a protein’s function–says Richard Bonneau, a Genentech executive director who joined the company last year when it acquired his startup, Prescient Design.
Absci and Helixon are also working with drugmakers to design medicines for cancer and autoimmune diseases using de novo methods. Absci announced a partnership in January with Merck to go after three drug targets, according to Mr. McClain. A Merck spokesman said the company has entered into a number of collaborations to explore the potential of artificial intelligence in drug development. Helixon last month signed with two big pharma companies to tackle previously undruggable diseases, CEO and founder Jian Peng says.
“All the hard problems in drug discovery have been stuck there for a long time and have been waiting for a new wave of technology to solve it,” says Ainnocence’s Dr. Pan. “This is really a paradigm-shifting methodology.”
Ultimately, many computational biologists expect protein-language models to yield benefits beyond faster drug development. The same technique might be used to produce better enzymes for degrading plastics, treating wastewater and cleaning up oil spills, among other environmental applications, the biologists say.
“Proteins are the workhorses of life,” ProFluent Bio’s Dr. Madani says. “They enable us to breathe and see, they enable the environment to be sustained, they enable human health and disease. If we can design better workers or new workers all together, that could have really wide-ranging applications.”