Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extension of ModelCIF for AF3 quality estimates #21

Closed
gtauriello opened this issue Sep 26, 2024 · 9 comments
Closed

Extension of ModelCIF for AF3 quality estimates #21

gtauriello opened this issue Sep 26, 2024 · 9 comments
Assignees

Comments

@gtauriello
Copy link

gtauriello commented Sep 26, 2024

Related to #20 and the issues mentioned in there, I would suggest to extend ModelCIF to capture all new types of quality estimates introduced with AlphaFold 3 (AF3). I also had a look at RoseTTAFold-AllAtom and the suggestions below would also capture anything needed there. I also believe that this should cover anything needed for chaidiscovery/chai-lab#52. Here is my suggested additions:

  1. Extend _ma_qa_metric.type to include:
    • "pLDDT to polymer" with detailed description "confidence score predicting accuracy according to lDDT with distances from each atom to CA or C1' of nearby polymer residues in [0,100]"
    • "boolean" with detailed description "0 or 1 depending on whether a check passed (1) or not (0)."
  2. Extend _ma_qa_metric.mode to include "per-chain", "per-chain-pairwise", "per-atom" and "per-atom-pairwise" (and yes I know it's a bit unfortunate that we used "local" for "per-residue" but ok...)
  3. New _ma_qa_metric_per_chain same as _ma_qa_metric_local but without label_comp_id and label_seq_id
  4. New _ma_qa_metric_per_chain_pairwise same as _ma_qa_metric_local_pairwise but without label_comp_id* and label_seq_id*
  5. New _ma_qa_metric_per_atom same as _ma_qa_metric_local but using atom_id (linked to _atom_site.id) instead of model_id and label_*
  6. New _ma_qa_metric_per_atom_pairwise same as _ma_qa_metric_local_pairwise but but using atom_id_1 and atom_id_2 (linked to _atom_site.id) instead of model_id and label_*

Concretely for AF3 output (e.g. looking at the JSON files in one of their examples) here is how each of the scores would map to a _ma_qa_metric.mode and .type:

  • fraction_disordered: "global", "normalized score"
  • has_clash: "global", "boolean"
  • iptm: "global", "ipTM"
  • ptm: "global", "pTM"
  • ranking_score: "global", "normalized score"
  • chain_ptm: "per-chain", "pTM"
  • chain_iptm: "per-chain", "ipTM"
  • chain_pair_iptm: "per-chain-pairwise", "ipTM"
  • chain_pair_pae_min: "per-chain-pairwise", "PAE"
  • atom_plddts: "per-atom", "pLDDT to polymer"
  • contact_probs: "per-atom-pairwise", "contact probability"
  • pae: "per-atom-pairwise", "PAE"

Some caveats to consider:

  • contact_probs and pae above are defined per "token" pair, where a token is either a full residue (for standard amino and nucleic acids) or a single atom otherwise. In AF3, the per-residue tokens have a well defined "token centre atom" (CA for standard amino acids, C1' for standard nucleotides) which could be used in per-atom scores but this may be confusing.
  • The "per-chain" scores also apply to non-polymers which may be a confusing naming. Technically "per-asym-id" is more correct although that may be only understandable by mmCIF experts.
  • For future applications in physics-based docking tools, we need to make sure that local scores can identify water molecules. In PDB those all share label_asym_id and do not have a label_seq_id and one could also change that to giving them separate label_asym_id in ModelCIF to fix this.

Alternative to the above (which simplifies some things and handles the per token scores):

  • Extend _ma_qa_metric_local and _ma_qa_metric_local_pairwise to include label_atom_id (linked to _atom_site.label_atom_id) which can be set to '.' for per-residue scores.
  • One could also handle per-chain scores by allowing label_comp_id and label_seq_id to be set to '.'.
  • With appropriate updates to the category and item descriptions, all types of local scores could be handled by the _ma_qa_metric_local and _ma_qa_metric_local_pairwise tables and no additional tables or _ma_qa_metric.mode values would be necessary.

@brindakv what are your thoughts on this?

@gtauriello
Copy link
Author

gtauriello commented Oct 18, 2024

Notes from discussions with @benmwebb , @brindakv and @aozalevsky (on Oct. 16):

  • Not good to add link to _atom_site.label_atom_id to _ma_qa_metric_local and _ma_qa_metric_local_pairwise as it overloads the tables and still doesn't enable clean handling of non-polymers (which is critical for AF3)
  • Alternative discarded suggestion was to link to _atom_site.id with a flag for granularity (atom, residue or chain). Pro: easy to use and look at. Con: cannot generalize to other features (e.g. residue ranges, domains, ...) and ambiguous on how to define (e.g. which atom to pick).
  • Preferred solution is to use features as in IHM's _ihm_feature_list

Example AF3 output (cut to only include one model instead of 5): fold_test_fold_job_number_one_cut.zip. Info on content:

  • fold_test_fold_job_number_one_job_request.json is input to AF3 (can be uploaded to the AF-Server)
  • fold_test_fold_job_number_one_model_0.cif is a (not 100% compliant) ModelCIF file. Note that copies of the same molecule (HEM, MG, and NA in this example) are handled with multiple identical molecular entities (instead of a single entity with multiple instances).
  • fold_test_fold_job_number_one_summary_confidences_0.json contains global, per-chain and per-chain-pair scores (see "Summary outputs" in AF-server-FAQ). Note that some values can be "null".
  • fold_test_fold_job_number_one_full_data_0.json contains the per-atom pLDDT and per-token-pair PAE and contact probabilities (see "Full array outputs" in AF-server-FAQ). Tokens are either a full residue (for standard amino and nucleic acids) or a single atom otherwise. Order of values is implicit according to order in atom_site of .cif file.
  • Chains in the model:
    • A: polymer (polypeptide; seq: "PREACHINGS"), residues 1 and 5 modified (HY3, P1L)
    • B: polymer (polypeptide; seq: "REACHER")
    • C: non-polymer (ATP)
    • D: non-polymer (HEM)
    • E: non-polymer (HEM)
    • F: non-polymer (MG)
    • G: non-polymer (MG)
    • H: non-polymer (NA)
    • I: non-polymer (NA)
    • J: non-polymer (NA)
    • K: polymer (polydeoxyribonucleotide; seq: "GATTACA"), residues 1 and 2 modified (6OG, 6MA)
    • L: polymer (polydeoxyribonucleotide; seq: "TGTAATC")
    • M: polymer (polyribonucleotide; seq: "GUAC"), residues 1 and 4 modified (2MG, 5MC)
    • N: branched (NAG-NAG-BMA)
    • O: branched (BMA)

Suggested ModelCIF extension:

  • Extend _ma_qa_metric.type as in first comment
  • Extend _ma_qa_metric.mode to include "per-feature" and "per-feature-pair"
  • New _ma_feature_list exactly like _ihm_feature_list except "branched" added to entity_type and feature_type which should include the following controlled vocabulary:
    • atom: "feature is an atom or a set of atoms for any entity type"
    • residue: "feature is a residue or a set of residues from a polymeric entity"
    • asym_id: "feature is an instance of a molecular entity"
  • New _ma_atom_feature category:
    • Description: "Data items in this category provide the definitions required to select specific atoms independently of entity type."
    • Items:
      • ordinal_id (key): "A unique identifier for the category."
      • feature_id (mandatory): "An identifier for the selected feature. This data item is a pointer to _ma_feature_list.feature_id in the MA_FEATURE_LIST category."
      • atom_id (mandatory): "The identifier of the atom. This data item is a pointer to _atom_site.id in the ATOM_SITE category."
  • New _ma_poly_residue_feature category:
    • Description: "Data items in this category provide the definitions required to select specific polymer residues."
    • Items (similar to ma_qa_metric_local):
      • ordinal_id (key): "A unique identifier for the category."
      • feature_id (mandatory): "An identifier for the selected feature. This data item is a pointer to _ma_feature_list.feature_id in the MA_FEATURE_LIST category."
      • label_asym_id (mandatory): "The identifier for the asym id of the residue in the structural model. This data item is a pointer to _atom_site.label_asym_id in the ATOM_SITE category."
      • label_comp_id (mandatory): "The component identifier for the residue in the structural model. This data item is a pointer to _atom_site.label_comp_id in the ATOM_SITE category."
      • label_seq_id (mandatory): "The identifier for the sequence index of the residue in the structural model. This data item is a pointer to _atom_site.label_seq_id in the ATOM_SITE category."
  • New _ma_asym_id_feature category:
    • Description: "Data items in this category provide the definitions required to select specific instances of a molecular entity independently of entity type (e.g. a polymer chain or a copy of a non-polymer)."
    • Items (similar to _ma_poly_residue_feature):
      • ordinal_id (key): "A unique identifier for the category."
      • feature_id (mandatory): "An identifier for the selected feature. This data item is a pointer to _ma_feature_list.feature_id in the MA_FEATURE_LIST category."
      • label_asym_id (mandatory): "The identifier for the asym id of the residue in the structural model. This data item is a pointer to _atom_site.label_asym_id in the ATOM_SITE category."
  • New _ma_qa_metric_feature category (similar to ma_qa_metric_local):
    • Description: "Data items in this category capture QA metrics calculated per feature (as defined in _ma_feature_list)."
    • Items:
      • ordinal_id (key), metric_id, metric_value, model_id (all mandatory) exactly as in ma_qa_metric_local
      • feature_id (mandatory): "The identifier for the feature, for which QA metric is provided. This data item is a pointer to _ma_feature_list.feature_id in the MA_FEATURE_LIST category."
  • New _ma_qa_metric_feature_pairwise category (similar to ma_qa_metric_local_pairwise):
    • Description: "Data items in this category capture QA metrics calculated per pair of features (as defined in _ma_feature_list)."
    • Items:
      • ordinal_id (key), metric_id, metric_value, model_id (all mandatory) exactly as in ma_qa_metric_local_pairwise
      • feature_id_1 (mandatory): "The identifier for the first feature in the pair, for which QA metric is provided. This data item is a pointer to _ma_feature_list.feature_id in the MA_FEATURE_LIST category."
      • feature_id_2 (mandatory): "The identifier for the second feature in the pair, for which QA metric is provided. This data item is a pointer to _ma_feature_list.feature_id in the MA_FEATURE_LIST category."
  • Note: if it is preferred to use something else instead of "asym_id" in the category name and feature_type, that's also ok...

@aozalevsky
Copy link

@gtauriello, I just wanted to follow up on this. With AF3 code and weights being released and with the recent addition of restraints to Chai-1, we can expect rapid growth in the number of deposited models. Would be nice to have the scores in those models.

@gtauriello
Copy link
Author

I agree. @brindakv was waiting for me to decide on a separate issue that we wanted to address in the same ModelCIF update and now I added that here as issue #23 . Hence, I think that she can now do the updates according to the open issues here.

Afterwards, we can try to suggest changes in alphafold3/model/mmcif_metadata.py to include this (and check if other things are invalid in their files).

@brindakv
Copy link
Contributor

@gtauriello please clarify my questions below.

  1. Do we need _ma_poly_residue_feature considering that ma_qa_metric_local sort of already handles it? The difference would be the ability to assign multiple residues to a feature. If this is a use case, then we can add it.
  2. Do we want ma_feature_list.feature_type to support contiguous residue ranges? If yes, then _ma_poly_residue_feature can have begin and end data items for seq_id and comp_id.
  3. What is the use case for ma_qa_metric.type = boolean? Should this be a separate data item elsewhere rather than an enumeration of ma_qa_metric.type?

@brindakv brindakv self-assigned this Nov 27, 2024
@gtauriello
Copy link
Author

1. Do we need `_ma_poly_residue_feature` considering that `ma_qa_metric_local` sort of already handles it? The difference would be the ability to assign multiple residues to a feature. If this is a use case, then we can add it.

The main use case for it is to be able to handle pairs between an atom and a residue in ma_qa_metric_feature_pairwise (needed for AF3's PAE matrix). We would not be able to do it in any other way.

2. Do we want `ma_feature_list.feature_type` to support contiguous residue ranges? If yes, then `_ma_poly_residue_feature` can have begin and end data items for `seq_id` and `comp_id`.

This would make the main existing use case in AF3 more verbose than necessary (we need a feature for each polymer residue to handle the PAE matrix) while I currently do not have a use case for contiguous residue ranges. If we need those ranges in the future, I would prefer to have them in a separate table.

3. What is the use case for `ma_qa_metric.type` = `boolean`? Should this be a separate data item elsewhere rather than an enumeration of `ma_qa_metric.type`?

The default ranking score in AF3 is calculated as 0.8 × ipTM + 0.2 × pTM + 0.5 × disorder − 100 × has_clash. I would like to be able to properly store all components of that and has_clash is a boolean pass/fail score (1 = pass, 0 = fail).

@brindakv
Copy link
Contributor

brindakv commented Nov 28, 2024

Thanks for clarifying @gtauriello.

The default ranking score in AF3 is calculated as 0.8 × ipTM + 0.2 × pTM + 0.5 × disorder − 100 × has_clash. I would like to be able to properly store all components of that and has_clash is a boolean pass/fail score (1 = pass, 0 = fail).

Should the enumeration for ma_qa_metric.type be has_clash or boolean?

Never mind. Boolean is good.

@brindakv
Copy link
Contributor

brindakv commented Nov 28, 2024

@gtauriello I suggest we add enumerations to _ma_associated_archive_file_details.file_content and _ma_entry_associated_files.file_content.

It can be generic (QA metrics) or specific (feature-based QA scores).

@gtauriello
Copy link
Author

gtauriello commented Nov 28, 2024

For ma_qa_metric.type: yes for boolean as you concluded already.

For file_content: I had not noticed that one but it is an excellent point. I would go for the generic (QA metrics) option and add a note for local pairwise QA scores that this is deprecated in favor of QA metrics.

@brindakv
Copy link
Contributor

Thanks @gtauriello. Updates have been committed, please see #25.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants