Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Micro-PR to add alt_atom_id to Biotite's internal CCD #668

Merged
merged 4 commits into from
Oct 19, 2024

Conversation

Croydon-Brixton
Copy link
Contributor

Micro-PR to add alt_atom_id to Biotite's internal CCD:

Why? This is useful for resolving atom names in PDB files that do not use the standard atom ids, but use the alternative atom ids instead.

@Croydon-Brixton
Copy link
Contributor Author

@padix-key the Test interfaces to databases and applications test fails due to

Error: Unable to locate executable file: /usr/share/miniconda/condabin/mamba. Please verify either the file path exists or the file can be found within a directory specified by the PATH environment variable. Also check the file mode to verify the file is executable.

This sounds like a CI/CD issue, I'm not sure there's anything I can do about this?

Copy link

codspeed-hq bot commented Oct 4, 2024

CodSpeed Performance Report

Merging #668 will not alter performance

Comparing Croydon-Brixton:feat/alt_atom_id (b3708ab) with main (231eefe)

Summary

✅ 44 untouched benchmarks

@padix-key
Copy link
Member

Hi, I checked the increased file size of the components.bcif: It went from 35.6 MB to 40.8 MB, which is tolerable in my opinion, when there is reasonable benefit.

However, I am not sure in which cases the alt_atom_id is actually used by the PDB: I had the impression that the PDB only uses the atom_id to name atoms in its entries. Do you have an example where alt_atom_id is used instead as atom name?

If alt_atom_id is actually used, one would probably need to adapt downstream functionality, e.g. connect_via_residue_names() to be able to from bonds between atoms with alternate names as well. Do you have other scenarios in mind where 'resolving atom names' is necessary?

Copy link
Member

@padix-key padix-key left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the conceptual discussion above, the change looks good to me, thank you. There is only the small outlined nitpick.

To make the changes in the internal CCD take effect, we need the built structure/info/ccd directory as well. It is built as artifact in the CI. I will add it to this PR when the CI finishes successfully.

setup_ccd.py Outdated Show resolved Hide resolved
@padix-key
Copy link
Member

padix-key commented Oct 6, 2024

This sounds like a CI/CD issue, I'm not sure there's anything I can do about this?

I deem this a temporary CI hickup as well. If this persists in a few days, I will look deeper into this.

Update: This error seems to be persistent, but I added a fix in #661.

@nscorley
Copy link
Contributor

nscorley commented Oct 8, 2024

Hi,

Piling-on here: We've come across a number of entries where the atom_ids are annotated with the alternative atom ID's rather than the standard atom ID's, and even a few where a single residue mixes naming conventions.

An egregious example:

  • 6q9t, where the ligand QUJ uses "C10" (an alternative atom name) rather than "CA" (the standard name), but uses the "standard" names for the other atoms

Providing the flexibility to return the alternative names enables us to write code handling these edge cases without significantly increasing overhead

@nscorley
Copy link
Contributor

nscorley commented Oct 8, 2024

Hi, I checked the increased file size of the components.bcif: It went from 35.6 MB to 40.8 MB, which is tolerable in my opinion, when there is reasonable benefit.

However, I am not sure in which cases the alt_atom_id is actually used by the PDB: I had the impression that the PDB only uses the atom_id to name atoms in its entries. Do you have an example where alt_atom_id is used instead as atom name?

If alt_atom_id is actually used, one would probably need to adapt downstream functionality, e.g. connect_via_residue_names() to be able to from bonds between atoms with alternate names as well. Do you have other scenarios in mind where 'resolving atom names' is necessary?

In regards to connect_via_residue_names() -- will need to double-check our examples, but I'm 99% certain this problem is isolated to ligands; amino acids and other monomeric residues tend to be named correctly. Thus, when connecting polymers via canonical bonds, I don't think it'll matter. Broadly speaking, the most foolproof approach would be to check whether any of the provided atoms are non-standard at loading time, and to then map them to their standard equivalents; it's less clear to me whether this behavior should be the default given the additional overhead.

@Croydon-Brixton
Copy link
Contributor Author

I think this is especially an issue as the PDB keeps making updates. From some older cif files I've got lying around I can see the alt_atom_id being used where the fresh download from the rcsb PDB now uses a standard atom name. I'm not sure how it's decided which naming to use, but it might be that the alt atom id is used to reference legacy versions of ligands from the CCD?

@Croydon-Brixton
Copy link
Contributor Author

In any case -- I addressed the ordering of the blocks now ✅

@nscorley
Copy link
Contributor

nscorley commented Oct 9, 2024

Hi,

Piling-on here: We've come across a number of entries where the atom_ids are annotated with the alternative atom ID's rather than the standard atom ID's, and even a few where a single residue mixes naming conventions.

An egregious example:

  • 6q9t, where the ligand QUJ uses "C10" (an alternative atom name) rather than "CA" (the standard name), but uses the "standard" names for the other atoms

Providing the flexibility to return the alternative names enables us to write code handling these edge cases without significantly increasing overhead

Double-checked this example and it seems in the most recent PDB update that was fixed, as @Croydon-Brixton mentioned -- will look for others

@padix-key
Copy link
Member

@nscorley Thanks for the pointer to the example entry. I will look into it. @Croydon-Brixton If you do not mind, I would also add the required changes to connect_via_residue_names() in this PR.

@padix-key
Copy link
Member

I merged #661. After you rebase on the main branch, the test should work.

@Croydon-Brixton
Copy link
Contributor Author

Amazing, thank you @padix-key -- I'm aiming to have these PR's updated for you later today or tomorrow. Will ping you when ready!

@padix-key padix-key merged commit 3a7437d into biotite-dev:main Oct 19, 2024
25 of 26 checks passed
@padix-key
Copy link
Member

And its merged. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants