-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Neutralise systems with formal charges from the PDB #301
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay in getting back to you. Your implementation looks good, and as you say, it's probably better than not doing it.
I'm still a little concerned about residues that can appear in multiple forms. Take a look at the CCD definition of HIS, for example. (I know it's a standard residue, and therefore this code won't be used for it. I'm just using it as an example of what we can probably expect to find in nonstandard residues too.) It can be positive, negative, or neutral depending on which hydrogens are present. The CCD entry describes the positively charged, doubly protonated form. But that isn't the most common form at pH 7.
On the positive side, if PDBFixer just added the hydrogens based on the CCD entry, that's the form it will use. So this does ensure that if PDBFixer adds both hydrogens and solvent, they'll be consistent with each other.
The other thing I'm a bit unsure about is the handling of leaving atoms. You require that no leaving atoms be present. If any of them are, it won't match. Would it be better to make it match whether or not the leaving atoms are present?
pdbfixer/pdbfixer.py
Outdated
smilesCol = descriptorsData.getAttributeIndex("descriptor") | ||
smiles = None | ||
for row in descriptorsData.getRowList(): | ||
if row[typeCol] in ["SMILES", "SMILES_CANONICAL"]: | ||
smiles = row[smilesCol] | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You record the SMILES string but never use it anywhere. What is it for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I was using it in a previous iteration. I'll remove it.
pdbfixer/pdbfixer.py
Outdated
@@ -1448,20 +1567,34 @@ def _createForceField(self, newTopology, water): | |||
template = app.ForceField._TemplateData(resName) | |||
forcefield._templates[resName] = template | |||
indexInResidue = {} | |||
# If we can't find formal charges in the CCD, make everything uncharged | |||
formalCharges = defaultdict(lambda: 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or more simply defaultdict(int)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, fixed
# Capping with GLY because ACE/NME currently breaks addMissingHydrogens | ||
# Adding a cap keeps the protein compact and the addSolvent call quick |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what way does it break it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was running into an issue where if I unconditionally capped the ends of a chain, chains that didn't have missing terminal loops would wind up with unprotonated caps. I mentioned it briefly in #299. I ended up attributing this to a bug in my own code and fixed it by only capping terminals with missing loops, and I've transferred that code over to this PR just now, but I can dig into it further if you like.
pdbfixer/pdbfixer.py
Outdated
checkCache : bool | ||
If ``False``, attempt to re-download the CCD entry regardless of | ||
what is in the cache. Defaults to ``True``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for including this flag? You never specify it when calling this method, and it's a private internal method not meant to be called by users, so the value is never False.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another leftover from an earlier iteration and I've removed it.
Yay!
An example of this is the CCD definition of CIT, citric acid. At pH 7, citric acid is most commonly triply deprotonated, but the CCD entry is neutral. I agree it would be nice to handle this case more carefully, both here and in the protonation code. I can imagine some simple algorithm along the lines of "add 1 formal charge to any atom that has an extra hydrogen bonded to it relative to the CCD, and remove 1 formal charge from any atom that is missing a hydrogen" which might improve the behavior in this PR for "nonstandard" protonation states. This would have pathological behavior on a PDB file that is missing hydrogens though - carbanions aplenty! And if In general, I don't think this should hold up this PR, because the PR will only introduce regressions if the current behavior was correct by luck. I think detecting alternative forms is a substantial and orthogonal design problem. EDIT: And actually, even then there's no regression, as the set of atom names will differ so the formal charges won't be applied.
This'll need to be factored in if the hydrogen adding system becomes pH aware for CCD sourced residues - the same mechanism will want to be applied here.
Yeah I think I misunderstood what the point of leaving atoms were - they're for defining polymer linkages, but I thought they were about alternate protonation states. I'll fix this! |
The test case is failing:
|
I think you're right: this won't always get the right answer, but it's more likely to get the right answer than the current code, so it's still an improvement. I think it would be worth also allowing the user to pass a ForceField to |
Just checking in on this so it doesn't get forgotten. The test case is still failing. |
Sorry Peter, I've been distracted! The failing test was a timeout from code I didn't touch downloading PDB files from the PDB. Hopefully it'll succeed this time - I think this one's good to go. |
Just checking in again. :) See the one comment above, where if it can't download one definition, it won't return any others either. I think that line needs to be changed? |
Sorry I've been off sick - I agree, and have made that change. Thanks for being so patient! |
Thanks, that looks good. I'm sorry you've been sick! |
This PR addresses #299 by:
_downloadCCDDefinition()
that parses and caches the definition._downloadFormalCharges()
which gets the formal charges of the atoms in a residue from the CCD_downloadFormalCharges()
in_createForceField(water=True)
to provide charges only when the residue is missing from the base force field and the (non-leaving) atom names in the CCD exactly match the atom names in the PDBThe refactor avoids re-downloading a CCD entry up to 3 times (once for protonation, once for missing heavy atoms, and a third time for formal charge) and reduces duplicate code. However, it is not essential for the other functionality and can be removed or split into its own PR if desired.
This solution works around the issues discussed in #299 where CCD residues tend to hard-code a single protonation state by only using the CCD formal charge when the force field does not provide the residue.
My reasoning is that there are two cases: first, when the force field includes a residue, it is capable of identifying the protonation state and gives the correct net charge and so it can only introduce regressions to interfere with this. In the second case, when a residue is introduced to the force field by
_createForceField(water=True)
, the atom's partial charge is currently hard-coded to 0 and so using more information to sometimes set that to better values is appropriate.We can break down the effect of this change (modulo bugs) like so:
amber14-all.xml
, behavior is unchanged. Otherwise:I think 2. should be much more common than 3., so this is a step in the right direction. In particular, in the very common case that PDBFixer is used to protonate a residue (with
addMissingHydrogens()
), this change guarantees that the protonation state will be consistent with the assumed charge - either both will be taken from the CCD, or the assumed charge will be inferred from the protonation state by the force field.I think something like an
offsetCharge
argument toaddSolvent()
andaddMembrane()
which would allow a user to correct any remaining errors in neutralising charge would be great, but this requires changes toModeller
and so is not included in this PR.