Protein design

Protein design is the rational design of new protein molecules to design novel activity, behavior, or purpose, and to advance basic understanding of protein function. Proteins can be designed from scratch (de novo design) or by making calculated variants of a known protein structure and its sequence (termed protein redesign). Rational protein design approaches make protein-sequence predictions that will fold to specific structures. These predicted sequences can then be validated experimentally through methods such as peptide synthesis, site-directed mutagenesis, or artificial gene synthesis.

Rational protein design dates back to the mid-1970s. Recently, however, there were numerous examples of successful rational design of water-soluble and even transmembrane peptides and proteins, in part due to a better understanding of different factors contributing to protein structure stability and development of better computational methods.

Overview and history

The goal in rational protein design is to predict amino acid sequences that will fold to a specific protein structure. Although the number of possible protein sequences is vast, growing exponentially with the size of the protein chain, only a subset of them will fold reliably and quickly to one native state. Protein design involves identifying novel sequences within this subset. The native state of a protein is the conformational free energy minimum for the chain. Thus, protein design is the search for sequences that have the chosen structure as a free energy minimum. In a sense, it is the reverse of protein structure prediction. In design, a tertiary structure is specified, and a sequence that will fold to it is identified. Hence, it is also termed inverse folding. Protein design is then an optimization problem: using some scoring criteria, an optimized sequence that will fold to the desired structure is chosen.

When the first proteins were rationally designed during the 1970s and 1980s, the sequence for these was optimized manually based on analyses of other known proteins, the sequence composition, amino acid charges, and the geometry of the desired structure. In 2003, David Baker's laboratory designed a full protein to a fold never seen before in nature. In 2010, one of the most powerful broadly neutralizing antibodies was isolated from patient serum using a computationally designed protein probe. In 2024, Baker received one half of the Nobel Prize in Chemistry for his advancement of computational protein design, with the other half being shared by Demis Hassabis and John Jumper of Deepmind for protein structure prediction. Due to these and other successes (e.g., see examples below), protein design has become one of the most important tools available for protein engineering. There is great hope that the design of new proteins, small and large, will have uses in biomedicine and bioengineering.

Underlying models of protein structure and function

Protein design programs use computer models of the molecular forces that drive proteins in in vivo environments. In order to make the problem tractable, these forces are simplified by protein design models. Although protein design programs vary greatly, they have to address four main modeling questions: What is the target structure of the design, what flexibility is allowed on the target structure, which sequences are included in the search, and which force field will be used to score sequences and structures.

Target structure

thumb|left|The [[Top7 protein was one of the first proteins designed for a fold that had never been seen before in nature]]

Protein function is heavily dependent on protein structure, and rational protein design uses this relationship to design function by designing proteins that have a target structure or fold. Thus, by definition, in rational protein design the target structure or ensemble of structures must be known beforehand. This contrasts with other forms of protein engineering, such as directed evolution, where a variety of methods are used to find proteins that achieve a specific function, and with protein structure prediction where the sequence is known, but the structure is unknown.

Most often, the target structure is based on a known structure of another protein. However, novel folds not seen in nature have been made increasingly possible. Peter S. Kim and coworkers designed trimers and tetramers of unnatural coiled coils, which had not been seen before in nature.

Sequence space

thumb|FSD-1 (shown in blue, PDB id: 1FSV) was the first de novo computational design of a full protein. The target fold was that of the zinc finger in residues 33–60 of the structure of protein Zif268 (shown in red, PDB id: 1ZAA). The designed sequence had very little sequence identity with any known protein sequence.

In rational protein design, proteins can be redesigned from the sequence and structure of a known protein, or completely from scratch in de novo protein design. In protein redesign, most of the residues in the sequence are maintained as their wild-type amino-acid while a few are allowed to mutate. In de novo design, the entire sequence is designed anew, based on no prior sequence.

Both de novo designs and protein redesigns can establish rules on the sequence space: the specific amino acids that are allowed at each mutable residue position. For example, the composition of the surface of the RSC3 probe to select HIV-broadly neutralizing antibodies was restricted based on evolutionary data and charge balancing. Many of the earliest attempts on protein design were heavily based on empiric rules on the sequence space. Backbone-dependent rotamer libraries, in contrast, describe the rotamers as how likely they are to appear depending on the protein backbone arrangement around the side chain. Most protein design programs use one conformation (e.g., the modal value for rotamer dihedrals in space) or several points in the region described by the rotamer; the OSPREY protein design program, in contrast, models the entire continuous region. Backbone flexibility is especially important in protein redesign because sequence mutations often result in small changes to the backbone structure. Moreover, backbone flexibility can be essential for more advanced applications of protein design, such as binding prediction and enzyme design. Some models of protein design backbone flexibility include small and continuous global backbone movements, discrete backbone samples around the target fold, backrub motions, and protein loop flexibility.

Physics-based energy functions, such as AMBER and CHARMM, are typically derived from quantum mechanical simulations, and experimental data from thermodynamics, crystallography, and spectroscopy. These energy functions typically simplify physical energy function and make them pairwise decomposable, meaning that the total energy of a protein conformation can be calculated by adding the pairwise energy between each atom pair, which makes them attractive for optimization algorithms. Physics-based energy functions typically model an attractive-repulsive Lennard-Jones term between atoms and a pairwise electrostatics coulombic term between non-bonded atoms.

thumb|left|Water-mediated hydrogen bonds play a key role in protein–protein binding. One such interaction is shown between residues D457, S365 in the heavy chain of the HIV-broadly-neutralizing antibody VRC01 (green) and residues N58 and Y59 in the HIV envelope protein GP120 (purple).

Statistical potentials, in contrast to physics-based potentials, have the advantage of being fast to compute, of accounting implicitly of complex effects and being less sensitive to small changes in the protein structure. These energy functions are based on deriving energy values from frequency of appearance on a structural database.

Protein design, however, has requirements that can sometimes be limited in molecular mechanics force-fields. Molecular mechanics force-fields, which have been

used mostly in molecular dynamics simulations, are optimized for the simulation of single sequences, but protein design searches through many conformations of many sequences. Thus, molecular mechanics force-fields must be tailored for protein design. In practice, protein design energy functions often incorporate both statistical terms and physics-based terms. For example, the Rosetta energy function, one of the most-used energy functions, incorporates physics-based energy terms originating in the CHARMM energy function, and statistical energy terms, such as rotamer probability and knowledge-based electrostatics. Typically, energy functions are highly customized between laboratories, and specifically tailored for every design. Even though the class of problems is NP-hard, in practice many instances of protein design can be solved exactly or optimized satisfactorily through heuristic methods.

Algorithms

Several algorithms have been developed specifically for the protein design problem. These algorithms can be divided into two broad classes: exact algorithms, such as dead-end elimination, that lack runtime guarantees but guarantee the quality of the solution; and heuristic algorithms, such as Monte Carlo, that are faster than exact algorithms but have no guarantees on the optimality of the results. Exact algorithms guarantee that the optimization process produced the optimal according to the protein design model. Thus, if the predictions of exact algorithms fail when these are experimentally validated, then the source of error can be attributed to the energy function, the allowed flexibility, the sequence space or the target structure (e.g., if it cannot be designed for).

Some protein design algorithms are listed below. Although these algorithms address only the most basic formulation of the protein design problem, Equation (), when the optimization goal changes because designers introduce improvements and extensions to the protein design model, such as improvements to the structural flexibility allowed (e.g., protein backbone flexibility) or including sophisticated energy terms, many of the extensions on protein design that improve modeling are built atop these algorithms. For example, Rosetta Design incorporates sophisticated energy terms, and backbone flexibility using Monte Carlo as the underlying optimizing algorithm. OSPREY's algorithms build on the dead-end elimination algorithm and A* to incorporate continuous backbone and side-chain movements. Thus, these algorithms provide a good perspective on the different kinds of algorithms available for protein design.

In 2020 scientists reported the development of an AI-based process using genome databases for evolution-based designing of novel proteins. They used deep learning to identify design-rules. In 2022, a study reported deep learning software that can design proteins that contain pre-specified functional sites.

With mathematical guarantees

Dead-end elimination

The dead-end elimination (DEE) algorithm reduces the search space of the problem iteratively by removing rotamers that can be provably shown to be not part of the global lowest energy conformation (GMEC). On each iteration, the dead-end elimination algorithm compares all possible pairs of rotamers at each residue position, and removes each rotamer r′i that can be shown to always be of higher energy than another rotamer ri and is thus not part of the GMEC:

: <math> E(r^\prime_i) + \sum_{j\ne i} \min_{r_j} E(r^\prime_i,r_j) > E(r_i) + \sum_{j\ne i} \max_{r_j} E(r_i,r_j) </math>

Other powerful extensions to the dead-end elimination algorithm include the pairs elimination criterion, and the generalized dead-end elimination criterion. This algorithm has also been extended to handle continuous rotamers with provable guarantees.

Although the Dead-end elimination algorithm runs in polynomial time on each iteration, it cannot guarantee convergence. If, after a certain number of iterations, the dead-end elimination algorithm does not prune any more rotamers, then either rotamers have to be merged or another search algorithm must be used to search the remaining search space. In such cases, the dead-end elimination acts as a pre-filtering algorithm to reduce the search space, while other algorithms, such as A*, Monte Carlo, Linear Programming, or FASTER are used to search the remaining search space.

A popular search algorithm for protein design is the A* search algorithm.

Message-passing based approximations to the linear programming dual

ILP solvers depend on linear programming (LP) algorithms, such as the Simplex or barrier-based methods to perform the LP relaxation at each branch. These LP algorithms were developed as general-purpose optimization methods and are not optimized for the protein design problem (Equation ()). In consequence, the LP relaxation becomes the bottleneck of ILP solvers when the problem size is large. Recently, several alternatives based on message-passing algorithms have been designed specifically for the optimization of the LP relaxation of the protein design problem. These algorithms can approximate both the dual or the primal instances of the integer programming, but in order to maintain guarantees on optimality, they are most useful when used to approximate the dual of the protein design problem, because approximating the dual guarantees that no solutions are missed. Message-passing based approximations include the tree reweighted max-product message passing algorithm, and the message passing linear programming algorithm.

Optimization algorithms without guarantees

Monte Carlo and simulated annealing

Monte Carlo is one of the most widely used algorithms for protein design. In its simplest form, a Monte Carlo algorithm selects a residue at random, and in that residue a randomly chosen rotamer (of any amino acid) is evaluated.

FASTER

The FASTER algorithm uses a combination of deterministic and stochastic criteria to optimize amino acid sequences. FASTER first uses DEE to eliminate rotamers that are not part of the optimal solution. Then, a series of iterative steps optimize the rotamer assignment.

Belief propagation

In belief propagation for protein design, the algorithm exchanges messages that describe the belief that each residue has about the probability of each rotamer in neighboring residues. The algorithm updates messages on every iteration and iterates until convergence or until a fixed number of iterations. Convergence is not guaranteed in protein design. The message mi→ j(rj that a residue i sends to every rotamer (rj at neighboring residue j is defined as:

: <math>m_{i\to j}(r_j) = \max_{r_i} \Big(e^{\frac{-E_i(r_i)-E_{ij}(r_i,r_j)}{T\Big) \prod_{k \in N(i)\backslash j} m_{k\to i (r_i)}</math>

Both max-product and sum-product belief propagation have been used to optimize protein design.

Applications and examples of designed proteins

Enzyme design

The design of new enzymes is a use of protein design with huge bioengineering and biomedical applications. In general, designing a protein structure can be different from designing an enzyme, because the design of enzymes must consider many states involved in the catalytic mechanism. However protein design is a prerequisite of de novo enzyme design because, at the very least, the design of catalysts requires a scaffold in which the catalytic mechanism can be inserted.

Great progress in de novo enzyme design, and redesign, was made in the first decade of the 21st century. In three major studies, David Baker and coworkers de novo designed enzymes for the retro-aldol reaction, a Kemp-elimination reaction, and for the Diels-Alder reaction. Furthermore, Stephen Mayo and coworkers developed an iterative method to design the most efficient known enzyme for the Kemp-elimination reaction. Also, in the laboratory of Bruce Donald, computational protein design was used to switch the specificity of one of the protein domains of the nonribosomal peptide synthetase that produces Gramicidin S, from its natural substrate phenylalanine to other noncognate substrates including charged amino acids; the redesigned enzymes had activities close to those of the wild-type.

Semi-rational design

Semi-rational design is a purposeful modification method based on a certain understanding of the sequence, structure, and catalytic mechanism of enzymes. This method is between irrational design and rational design. It uses known information and means to perform evolutionary modification on the specific functions of the target enzyme. The characteristic of semi-rational design is that it does not rely solely on random mutation and screening, but combines the concept of directed evolution. It creates a library of random mutants with diverse sequences through mutagenesis, error-prone RCR, DNA recombination, and site-saturation mutagenesis. At the same time, it uses the understanding of enzymes and design principles to purposefully screen out mutants with desired characteristics.

The methodology of semi-rational design emphasizes the in-depth understanding of enzymes and the control of the evolutionary process. It allows researchers to use known information to guide the evolutionary process, thereby improving efficiency and success rate. This method plays an important role in protein function modification because it can combine the advantages of irrational design and rational design, and can explore unknown space and use known knowledge for targeted modification.

Semi-rational design has a wide range of applications, including but not limited to enzyme optimization, modification of drug targets, evolution of biocatalysts, etc. Through this method, researchers can more effectively improve the functional properties of proteins to meet specific biotechnology or medical needs. Although this method has high requirements for information and technology and is relatively difficult to implement, with the development of computing technology and bioinformatics, the application prospects of semi-rational design in protein engineering are becoming more and more broad.

Design for affinity

Protein–protein interactions are involved in most biotic processes. Many of the hardest-to-treat diseases, such as Alzheimer's disease, many forms of cancer (e.g., TP53), and human immunodeficiency virus (HIV) infection involve protein–protein interactions. Thus, to treat such diseases, it is desirable to design protein or protein-like therapeutics that bind one of the partners of the interaction and, thus, disrupt the disease-causing interaction. This requires designing protein-therapeutics for affinity toward its partner.

Protein–protein interactions can be designed using protein design algorithms because the principles that rule protein stability also rule protein–protein binding. Protein–protein interaction design, however, presents challenges not commonly present in protein design. One of the most important challenges is that, in general, the interfaces between proteins are more polar than protein cores, and binding involves a tradeoff between desolvation and hydrogen bond formation. To overcome this challenge, Bruce Tidor and coworkers developed a method to improve the affinity of antibodies by focusing on electrostatic contributions. They found that, for the antibodies designed in the study, reducing the desolvation costs of the residues in the interface increased the affinity of the binding pair.

Scoring binding predictions

Protein design energy functions must be adapted to score binding predictions because binding involves a trade-off between the lowest-energy conformations of the free proteins (EP and EL) and the lowest-energy conformation of the bound complex (EPL):

: <math>\Delta_G = E_{PL} - E_P - E_L </math>.

The K* algorithm approximates the binding constant of the algorithm by including conformational entropy into the free energy calculation. The K* algorithm considers only the lowest-energy conformations of the free and bound complexes (denoted by the sets P, L, and PL) to approximate the partition functions of each complex: Further, positive and negative design was also used by Anderson and coworkers to predict mutations in the active site of a drug target that conferred resistance to a new drug; positive design was used to maintain wild-type activity, while negative design was used to disrupt binding of the drug. Recent computational redesign by Costas Maranas and coworkers was also capable of experimentally switching the cofactor specificity of Candida boidinii xylose reductase from NADPH to NADH.

Protein resurfacing

Protein resurfacing consists of designing a protein's surface while preserving the overall fold, core, and boundary regions of the protein intact. Protein resurfacing is especially useful to alter the binding of a protein to other proteins. One of the most important applications of protein resurfacing was the design of the RSC3 probe to select broadly neutralizing HIV antibodies at the NIH Vaccine Research Center. First, residues outside of the binding interface between the gp120 HIV envelope protein and the formerly discovered b12-antibody were selected to be designed. Then, the sequence spaced was selected based on evolutionary information, solubility, similarity with the wild-type, and other considerations. Then the RosettaDesign software was used to find optimal sequences in the selected sequence space. RSC3 was later used to discover the broadly neutralizing antibody VRC01 in the serum of a long-term HIV-infected non-progressor individual.

Design of globular proteins

Globular proteins are proteins that contain a hydrophobic core and a hydrophilic surface. Globular proteins often assume a stable structure, unlike fibrous proteins, which have multiple conformations. The three-dimensional structure of globular proteins is typically easier to determine through X-ray crystallography and nuclear magnetic resonance than both fibrous proteins and membrane proteins, which makes globular proteins more attractive for protein design than the other types of proteins. Most successful protein designs have involved globular proteins. Both RSD-1, and Top7 were de novo designs of globular proteins. Five more protein structures were designed, synthesized, and verified in 2012 by the Baker group. These new proteins serve no biotic function, but the structures are intended to act as building-blocks that can be expanded to incorporate functional active sites. The structures were found computationally by using new heuristics based on analyzing the connecting loops between parts of the sequence that specify secondary structures.

Design of membrane proteins

Several transmembrane proteins have been successfully designed, along with many other membrane-associated peptides and proteins. Recently, Costas Maranas and his coworkers developed an automated tool to redesign the pore size of Outer Membrane Porin Type-F (OmpF) from E.coli to any desired sub-nm size and assembled them in membranes to perform precise angstrom scale separation.

Other applications

One of the most desirable uses for protein design is for biosensors, proteins that will sense the presence of specific compounds. Some attempts in the design of biosensors include sensors for unnatural molecules including TNT. More recently, Kuhlman and coworkers designed a biosensor of the PAK1.

In a sense, protein design is a subset of battery design.