Semi-supervised Retrieval of Functional Residues Through the Integration of Protein Language Models and Gene Ontology Data
Dickson, A. M., Mouline, S., Tamadon, A., Mofrad, M.
bioRxiv·2025
MotivationExperimental studies of protein function often focus on mechanistic descriptions, characterizing how specific sites and residues contribute to activity. Abstractions such as domains and active sites enable quantitative descriptions of how protein features act biologically. Thanks to the abundance of high-quality sequence and function data, machine learning has achieved great success in directly predicting protein function. However, translating functional characterizations into mechanistic ones on the level of the domains, binding sites, or motifs remains challenging. This represents a semi-supervised problem: sequences and global functional labels are available, but local annotations must be inferred.
ResultsWe investigate the unsupervised discovery of functionally active protein regions by integrating protein sequence models with functional information. We first formalize the residue-level functional annotation problem by constructing unified evaluation datasets linking Gene Ontology functions to annotated residues. Eight datasets are assembled, spanning levels of specificity from single active-site residues to domains covering up to 60% of a protein. We then introduce a new class of function-conditioned generative models that more accurately predict functionally important residues than existing approaches, including interpretability methods and PSSM entropy estimation, across multiple benchmark datasets.
Availabilitygithub.com/mofradlab/go_interp
Contactmofrad@berkeley.edu