Protein inverse folding problem is a natural inverse problem to protein structure prediction: given a target structure in three dimensions, we desire to design an amino acid sequence that is likely to fold to the given structure. However, for a structure of length N, there will be a total of 20N possible sequences even without considering different orientations of the side chains (“rotamer configuration”), since there are 20 amino acid types. Exhaustive enumeration of all possible sequences and then selection of the best sequence for given structure is beyond the scope of the computational power. Developing an appropriate model to study the protein inverse folding problem is challenging.
The inverse folding problem is considered as an optimization problem based on a fitness function. A model of Sun et al. (1995) casted this problem as an optimization problem on a space of sequences of hydrophobic (H) and polar (P) monomers; the goal is to find a sequence which achieves a dense hydrophobic core with few solvent-exposed hydrophobic residues without guarantee of optimality or near-optimality (Sun et al., 1995); Kleinberg converted this problem to an efficient flow network algorithm in order to construct optimal sequences (Kleinberg, 1999).
In this study, our tasks encompass finding out the optimal sequences for a given protein three-dimensional structure. After simulating and verifying Kleinberg’s flow network algorithm, we explore a new linear fitness function based on the GCSE model of Sun et al.(1995) and Kleinberg’s flow network transformation(1999), and we have solved this protein inverse folding problem by providing a simple and efficient linear programming method to construct optimal sequences. We demonstrate our model’s effectiveness by implementation on 23 structures drawn from the Protein Data Bank. Furthermore, we also consider the extensions of this linear programming method with specified residue composition in a sequence, as a way to overcome the limitations that a sharp imbalance in the ratio of H to P residues prevents designed sequences from having a high degree of agreement with the natural sequence in most cases.
The linear programming method for solving the inverse folding problem provides a general model for studying a variety of problems in protein design, including the design of new proteins and modification of existing proteins in order to alter their functions, structures, and folding properties. It can be further extended to solve the inverse binding problem, namely, to seek to design protein sequence for the optimal binding, which arises in many situations. This has significant importance in protein design, protein engineering, and also in searching therapeutic agents.