Skip to content

A compilation of deep learning methods for protein design

Notifications You must be signed in to change notification settings

mailmrcai/design_tools

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 

Repository files navigation

💡 Notes

  • This is a list accompanying our preprint: https://www.biorxiv.org/content/10.1101/2022.08.31.505981v1 . We focus on deep learning methods for protein design released after 2018 (and mostly 2019). This table complements Table 1 in our manuscript.

  • We curated this list manually and as such it might be incomplete. Please drop us an email or open an issue if you find we didn't describe your method correctly or it's missing.

  • We order the methods by release date (preprint when available) and categorize them in four classes (for more details on these categories see our preprint, Figure 1 and text):

    • 1️⃣: 'Fixed-backbone' protein design; p(sequence|structure)
    • 2️⃣: Structure generation; p(structure)
    • 3️⃣: Sequence generation; p(sequence) or p(sequence|sequence*)
    • 4️⃣: Concomitant protein and sequence design. p(sequence and structure) (which can be constrained).
  • Others before us have also done a fantastic work assembling deep learning methods for other protein-related problems, sometimes overlapping with this list. We link these lists here:

  • 💥 This work was recently highlighted in Nature

Contributors

Class I: Protein Sequence design ("Fixed-backbone")

Methods in this class attempt to solve the classical protein design problem: Find an optimal sequence that adopts a pre-determined 3D structure.

Name Architecture Number of Parameters User Input Output Training Dataset Paper Code Release Month/Year
SPIN2 FNN ~105k 3D structure sequence 1,532 X-ray structures Paper Code used to be here - no longer available 2018/02
SPROF CNN-LSTM - 3D structure sequence 1,532 X-ray structures Paper Code Web Server 2019/08
Ingraham et al. modified Transformer >3k  sequence  CATH 4.2 40% sequences/structures  Paper Code 2019/12
ProDCoNN CNN >28k 3D structure  sequence  Two datasets: ID90TR: 17,044; ID30TR: 9,135 sequences/PDB pairs  Paper Reimplementation 2019/12
Anand et al. CNN - 3D structure  Amino acid and side chain conformation 53,414 CATH domain structures   Paper Code 2020/01
DenseCPD CNN 3M 3D structure  sequence 11,227 X-ray structures    Paper Web server Reimplementation 2020/01
ProteinSolver GNN - 3D structure  sequence  72,464,122 sequences/adjacency matrices pairs  Paper Code 2020/03
Norn et al. CNN N/A distances, angles, and dihedrals for every pair of residues (trRosetta)  sequence  N/A  Paper Code 2020/07
GVP-GNN GVP - 3D structure sequence  CATH 4.2 40% sequences/structures   Paper Code 2020/09
Fold2Seq modified Transformer - 3D structure   sequence 45,995 3D structures from CATH 4.2 filtered @ 100%  Paper Code 2021/06
CNN_protein_landscape CNN >10M 3D structure   sequence 16,569 PDB chains  Paper Code 2021/08
Orellana et al. GCN - 3D structures  sequence  CATH 4.2 40% sequences/structures  Paper - 2021/11
ABACUS-R Transformer 152M 3D structures  sequence  CATH 4.2  Paper Code 2022/02
ESM-IF1 GVP-Transformer 142M 3D structure  sequence 16k X-ray structures + 1.2M AF2 predictions  Paper Code 2022/04
TERMinator GNN - 3D structures  sequences CATH 4.2 40% sequences/structures   Paper - 2022/04
McPartlon et al. modified Transformer - 3D structures  sequences 37k X-ray structures from BC40   Paper - 2022/04
MIF Structured GNN 6.8M 3D structure  sequence    Paper Code 2022/05
ProteinMPNN MPNN 1.8M 3D structure  sequence  CATH 4.2 40% sequences/structures  Paper Code Web Interface 2022/06
ProDESIGN-LE Transformer + FNN - 3D structure  sequence  5,867,488 residues from PDB40  Paper - 2022/07
TIMED CNN 3M 3D structure  sequence  32k structures from the PISCES server  Paper Code 2022/08
PiFold GNN - 3D structure  sequence  -  Paper 2022/09

Class II: Structure generation

Methods in this class generate structures unconditionally or from a set of secondary structural conditions.

Name Architecture Number of Parameters User Input Output Training Dataset Paper Code Release Month/Year
64GAN GAN - - contact map (3D structure via ADMM) 427,659 contact maps Paper - 2018/12
Anand et al. GAN - - distance map (3D structure via CNN) 800,000 distance maps Paper 2019/03
RamaNet LSTM >2k - A sequence of φ and ψ angles 607 helical structures Paper Code 19/06
DECO-VAE VAE - Structures represented as graphs contact graph (translatable to contact map) >650,000 contact graphs Paper Upon request 2020/04
SCUBA NC-NN ~20k secondary structure motif backbone 12,465 structures Paper Code 2022/02
Ig-VAE VAE - - protein backbone coordinates 10,768 individual immunoglobulin domains Paper Code 2022/02
GENESIS VAE - secondary structure motif contact map 40,726 backbones with remodeled loops Paper - 2022/03
ProtDiff & SMCDiff EGNN - Optional: structural motif coordinates 4,269 PDB structures Paper - 2022/06
Lai et al. VAE - topology protein backbone coordinates CATH 4.2 40% sequences/structures Paper - 2022/07
ProteinSGM SDE + RefineNet - optional: masked matrices matrices describing distance and torsional angles 10,361 CATH 4.3 95% structures Paper - 2022/07
FoldingDiff Transformer - - internal angles CATH 4.2 40% structures Paper Code 2022/09

Class III: Sequence generation

Methods in this class generate sequences usually from autoregressive language models, and can sometimes be conditioned.

Name Architecture Number of Parameters User Input Output Training Dataset Paper Code Release Month/Year
ProteinGAN GAN 60M sequence 16,706 MDH sequences Paper Code 2019/10
ProGen Transformer 1.2B Optional: sequence or function sequence 280M sequences Paper 2020/03
ProtXLnet Transformer 409M Optional: sequence sequence UniRef100 Paper Code 2020/07
ProtXL Transformer 562M Optional: sequence sequence BFD100 Paper 2020/07
ProtElectra-Generator Transformer 420M Optional: sequence sequence Uniref100 Paper Code 2020/07
ProtT5 Transformer 11B Optional: sequence sequence BFD100 Paper Code 2020/07
EVE VAE MSA Sequence 3,219 MSAs extracted from UniRef100 Paper Code 2020/12
DARK3 Transformer 110M Optional: sequence sequence 615,000 synthetic sequences Paper - 2022/01
ReLSO Modified transformer 110M sequence sequence and predicted value for label directed evolution datasets Paper Code 2022/02
ProtGPT2 Transformer 739M Optional: sequence sequence UniRef50 Paper Code 2022/03
RITA Transformer 1.2B Optional: sequence sequence UniRef100 Paper Code 2022/05
Tranception Transformer 700M Optional: sequence sequence UniRef100 Paper Code 2022/05
ProGEN2 Transformer 6.4B Optional: sequence sequence Uniref90+BF30 Paper Code 2022/06

Class IV: Sequence and structure design

Methods in this class generate sequences and structures concomitantly, and include hallucination methods and constrained generation (inpainting)

Name Architecture Number of Parameters User Input Output Training Dataset Paper Code Release Month/Year
Hallucination CNN (trRosetta) N/A random sequence sequence/structure N/A Paper Code 2020/07
Constrained hallucination CNN (trRosetta) N/A sequence/structure sequence/structure N/A Paper Code 2020/11
Constrained hallucination2 CNN (RoseTTAFold) N/A sequence/structure sequence/structure N/A Paper Code 2021/11
RFjoint CNN (RoseTTAFold, finetuned) N/A sequence/structure sequence/structure Finetuned with 25% PDB version 02/2020 + 75 % AF2 structures Paper Code 2021/11
Protein Diffusion Diffussion model - Secondary structure motif sketches sequence/structure 53,414 3D structures (95% CATH 4.2 S95) Paper Code 2022/05
Roney AlphaFold2 N/A random sequence sequence/structure N/A Paper Code 2022/06

About

A compilation of deep learning methods for protein design

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published