Thesis

Distillation of Language Model Semantics to Folded Three-Dimensional Protein Structures

allan costa 

Abstract

Determining the structure of proteins has been a long-standing goal in biology.  Language models have been recently deployed to capture the evolutionary semantics of protein sequences, and as an emergent property, were found to be structural learners. Enriched with multiple sequence alignments (MSA), these transformer models were able to capture significant information about a protein’s tertiary structure. In this work, we show how such structural information can be recovered by processing language model embeddings, and introduce a two-stage folding pipeline to directly estimate three-dimensional folded structures from protein sequences. We envision that this pipeline will provide a basis for efficient, end-to-end protein structure prediction through protein language modeling

Related Content