Therapeutic (monoclonal) antibodies are one of the most effective therapies available today for the treatment of chronic inflammatory diseases such as Crohn's disease, lupus and multiple sclerosis. To treat the latter, monoclonal antibodies can target certain proteins involved in these pathologies with a view to neutralizing them, and can also be used to limit the supply of factors essential to tumor growth or disruptors of the tumor microenvironment. Monoclonal antibody-based serotherapy can also compensate for treatment shortfalls in the case of fulminant epidemics where the pathogens involved have a high mutability rate, such as COVID-19.
Although promising and a major product on the pharmaceutical market, only around thirty monoclonal antibodies are currently available for chronic inflammatory diseases, and around ten for the treatment of cancer. This lack of comprehensiveness is due to the many difficulties inherent in the in-vitro and in-silico design of these therapeutic molecules. Antibody design and/or optimization remains a real challenge, not least because of the need to produce molecules that are effective, target-specific and deliverable to the organs being treated. The difficulties are also linked to long and costly development times.
In order to accelerate the development of therapeutic antibodies, in-silico methods have been developed to reduce modeling times for these molecules, while exploring design possibilities more exhaustively. Although advantageous, these methods currently rely essentially on estimating the affinity between the antibody and its target by calculating the binding energy, which remains difficult to estimate and extremely time-consuming from an experimental point of view.
Two data sets are available, one about multiple species from SabDab, the other about COVID [Cov-AbDab] [https://opig.stats.ox.ac.uk/webapps/covabdab/].
All data previously mentionned are free to acces.
The data use here are part of SabDab. They relates to immune complexes characterized through X-ray crystallography.
Two files have been collected :
-
All_PDB_files.txt : which contains ids for each constitutive proteine-protein structure (Antigen-Antibody). The first four characters of the structure name refer to the RCSB PDB database. Second part of the ids concern the chains inside the structures.
-
Positive_samples.txt : which contains all the positively interacting proteins from same or different complexes. For instance,
4gms_J_N_E 2vir_B_A_C
relates interactions between 4gms and 2vir. No specific order is precised, meaning that first partner can act the antigen or the antibody. This is reciprocal. Indedd as there are complexes, the antigenic chains of 4gms can form immune complexes with antibody chains of 2vir and antigenic chains of 2vir can also form immune complexes with antibody chains of 4gms. Obviously antigenic and antibody parts chains of a given RCSB ids are forming complexes.
Therefore, two complexes ids that are not matched are not able to form complexes and will act as negative samples.
All the structures are directly collected from SabDab as fasta files looking as :
>1A2Y_1|Chain A|IGG1-KAPPA D1.3 FV (LIGHT CHAIN)|Mus musculus (10090)
DIVLTQSPASLSASVGETVTITCRASGNIHNYLAWYQQKQGKSPQLLVYYTTTLADGVPSRFSGSGSGTQYSLKINSLQPEDFGSYYCQHFWSTPRTFGGGTKLEIK
>1A2Y_2|Chain B|IGG1-KAPPA D1.3 FV (HEAVY CHAIN)|Mus musculus (10090)
QVQLQESGPGLVAPSQSLSITCTVSGFSLTGYGVNWVRQPPGKGLEWLGMIWGDGNTDYNSALKSRLSISKDNSKSQVFLKMNSLHTDDTARYYCARERDYRLDYWGQGTTLTVSS
>1A2Y_3|Chain C|LYSOZYME|Gallus gallus (9031)
KVFGRCELAAAMKRHGLANYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDVQAWIRGCRL
Sequence informations lines start with > and following line correspond to the constitutive chain of residues (amino-acids).
The data use here are part of Cov-AbDab. They relates to immune complexes.
Three files are available :
-
positive dataset.txt : a three column file metionning the Sars-cov identifier, the antibody sequence and the antigen sequence. Such chains can form immune complexes.
-
negative dataset.txt : a three column file metionning the Sars-cov identifier, the antibody sequence and the antigen sequence. Such chains do not form immune complexes.
-
Independant test.txt :
For convenience, some scripts have been written to parse SAbDab database to collect all the fasta files of complexes and to structures sequences table.
To get all the sequences please use ./scripts/download_pdb.py through :
python download_pdb.py
All fasta files will be saved in ./data/SabDab/fasta folder.
To get the interaction table use :
python get_interaction_table.py
This will create data.csv a three column tabular file containing antibody identifier, antigen identifier and 0/1 depending on the ability to form immune complexes. Identifiers are supplemented wit |ag or |ab to refer to the antigenic or antiboy part of the complex.
The data look like :
ab;ag;interaction
5kel|ab;5kel|ag;1
5kel|ab;6cwt|ag;0
...
To have the sequences table use :
python get_seq_table.py
This will create sequences.csv_ a three column tabular file containing extended identifier such as abcd|ag or abcd|ab, the species from where it comes and the chain of residues.
seq_id;specie;sequence
5kel|ag;Zaire ebolavirus (strain Mayinga-76) (128952);IPLGVIHNSTLQVSDVDKLVCRDKLSSTNQLRSVGLNLEGNGVATDVPSATKRWGFRSGVPPKVVNYEAGEWAENCYNLEIKKPDGSECLPAAPDGIRGFPRCRYVHKVSGTGPCAGDFAFHKEGAFFLYDRLASTVIYRGTTFAEGVVAFLILPQAKKDFFSSHPLREPVNATEDPSSGYYSTTIRYQATGFGTNETEYLFEVDNLTYVQLESRFTPQFLLQLNETIYTSGKRSNTTGKLIWKVNPEIDTTIGEWAFWETKKNLTRKIRSEELSFTVVSNGAKNISGQSPARTSSDPGTNTTTEDHKIMASENSSAMVQVHSQGREAAVSHLTTLATISTSPQSLTTKPGPDNSTHNTPVYKLDISEATQVEQHHRRTDNDSTASDTPSATTAAGPPKAENTNTSKSTDFLDPATTTSPQNHSETAGNNNTHHQDTGEESASSGKLGLITNTIAGVAGLITGGRRTRR
5kel|ag;Zaire ebolavirus (128952);EAIVNAQPKCNPNLHYWTTQDEGAAIGLAWIPYFGPAAEGIYTEGLMHNQDGLICGLRQLANETTQALQLFLRATTELRTFSILNRKAIDFLLQRWGGTCHILGPDCCIEPHDWTKNITDKIDQIIHDFVDKTLPDLEVDDDD
...
-
Choas Game Representation
-
One Hot
https://github.com/anazhmetdin/protEncoder/tree/main/protencoder
-
k-mers
https://github.com/anazhmetdin/protEncoder/tree/main/protencoder
-
Prot-vec
https://github.com/anazhmetdin/protEncoder/tree/main/protencoder
-
Prot encoder
https://github.com/anazhmetdin/protEncoder/tree/main/protencoder
https://github.com/sebgra/Tensorflow_Advanced_Specialization/blob/main/C1/week_1/C1_W1_Lab_3_siamese-network.ipynb
-
Siamese Network
-
Double Channel Siamese Network
https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08772-6
All the data that can be used fir the challenge can be found on SabDab
To get access to all the data the search module is used.
To get data containing both antibody and proteic antigene sequences with affinity use this - 468 entries To get data containing both antibody and proteic antigene sequences without affinity use this - 5092 entries.
To get data containing both antibody and non necessary proteic antigene sequences with affinity use this - 737 entries
To get data containing both antibody and non necessary proteic antigene sequences without affinity use this - 7825 entries.
More criteria can be applied to select data from here
Backup data can be found here
Siamese Network https://www.frontiersin.org/journals/immunology/articles/10.3389/fimmu.2022.1053617/full#h6x
https://github.com/emersON106/AbAgIntPre/tree/main
- Get this data to have all the usefull PDBs, then collect all the corresponding fasta files.
- Parse all the Fasta grepping "heavy chain", "light chain", "antibody", "antigene" to create dataset of sequences for both Ag and Ab.