AI-powered language model generates functional protein sequences

The first time a language model was used to synthesize human proteins. 

Jijo Malayil
AI-powered language model generates functional protein sequences
Chain of amino acid or biomolecules called protein

Christoph Burgstedt/iStock 

Of late, AI models are really flexing their muscles. We have recently seen how ChatGPT has become a poster child for platforms that comprehend human languages. Now a team of researchers has tested a language model to create amino acid sequences, showcasing abilities to replicate human biology and evolution. 

The language model, which is named ProGen, is capable of generating protein sequences with a certain degree of control. The result was achieved by training the model to learn the composition of proteins. The experiment marks the first time a language model was used to synthesize human proteins. 

A study regarding the research was published in the journal Nature Biotechnology Thursday. The project was a combined effort from researchers at the University of California-San Francisco and the University of California-Berkeley and Salesforce Research, which is a science arm of a software company based in San Fransisco. 

The significance of using a language model

Researchers say that a language model was used for its ability to generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics.

“In the same way that words are strung together one-by-one to form text sentences, amino acids are strung together one-by-one to make proteins,” Nikhil Naik, the Director of AI Research at Salesforce Research, told Motherboard. The team applied “neural language modeling to proteins for generating realistic, yet novel protein sequences.”

The study was based on training the model with 280 million protein sequences from over 19,000 families, which was “augmented with control tags specifying protein properties.”

According to Motherboard, the use of conditional language models by the team allows for significantly more control over what types of sequences are generated, making them more useful for designing proteins with specific properties.  

The use case scenarios of such a development 

The flexibility of such a model to generate functional artificial proteins across protein families has promising applications. According to the team, “additional analyses suggest that our model has learned a flexible protein sequence representation that can be applied to diverse families like lysozymes, CM, and MDH.”

Since proteins form the building blocks of the human body, further studies are investigating how ProGen could identify treatment for disorders like rheumatoid arthritis and multiple sclerosis.

Abstract

Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve the controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase.