Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

So what if i dont know the charges #43

Open
Ltwicke opened this issue Oct 18, 2024 · 6 comments
Open

So what if i dont know the charges #43

Ltwicke opened this issue Oct 18, 2024 · 6 comments

Comments

@Ltwicke
Copy link

Ltwicke commented Oct 18, 2024

Hello,

im a physicist and i learned about molecules in terms of quantum mechanical many body systems rather than in the good old chemical fashion that speaks bond orders and stereochemistry, so please be easy on me, if i got something fundamentally wrong.

I have a dataset of xyz file from qm calculations and i need to make mol objects out of them for further processing. Im using this:

mol = Chem.MolFromXYZBlock(xyz_block)
rdDetermineBonds.DetermineBonds(mol, charge=0)

as im expecting the molecules just to be neutrally charged, i have no information sadly. Now this procedure fails for some of my conformers:

image

with

ValueError: Final molecular charge (-2) does not match input (0); could not find valid bond ordering

Strangely enough, it fails for a certain conformer of a molecule, while it is able to process the others before. These are the atomic coordinates, in case someone can see something from them:

image

My question is: Is there a physical reason, why this algorithm cant find the bonds if the charge is unknown? I suppose it tries to distribute the given charge over the atoms and is unhappy if it cant match the charge with some pairs of coordinates based on some rules. Then i wonder; why is there no option to infer the most likely charge and just use this?

If it helps anything: im using version '2024.03.5' of rdkit.

Hope to hear from you,

Thanks!

Edit: For my total of 59k, it fails only 184 times, however, as stated before, it randomly fails for different conformers for the same molecule...

@nbehrnd
Copy link

nbehrnd commented Oct 18, 2024

The first screen photo looks like atoms of oxygen is assigned a negative charge each,

5 8 O chg: -1

If this snippet is about the complete molecule, the total charge would be equal to -4 which then would call for a

rdDetermineBonds.DetermineBonds(mol, charge=-4)

or

xyz2mol.py molecule.xyz --charge -4

respectively because the earlier line is about "create bonds for all molecule's atoms and an overall charge of ...".

However, for the size of the molecule, -4 is unusually high. In the chemistry lab, one deprotonates (removes hydrogens) from alcohols with sodium, then, the remaining oxygen atoms carry a negative charge. But this is at a high pH value not frequently seen in living organisms.

Share the xyz file in question, for instance within a zip then attached to the mask here.

@jhjensen2
Copy link
Member

@Ltwicke The default bond perception method uses interatomic distances together with cutoffs. The interatomic distances can be slightly different for different conformers, which can give different results for different conformers. You can try different value of covFactor or useHueckel=True to see if that helps.

It's also important that you use the correct charge, i.e. the same charge you used for your QM calculations

@Ltwicke
Copy link
Author

Ltwicke commented Nov 13, 2024

@nbehrnd Your not entirely correct, it also assigns the charge 1 to C twice (always below the two Os), so the total charge would therefore be -2. The xyz "file" is actually just a string block created shortly to make the mol object.

@jhjensen2 The questions still remains, why the bond perception method needs the charge to work. I assume that when for some conformers, the distance between two atoms becomes so large, the algorithm tries to justify it by assigning a charge to the molecule and then the charge parameter is more like an assertion that this assumption is right? Then i still wonder, why there is no option to just infer the charge in this sense...

@nbehrnd
Copy link

nbehrnd commented Nov 13, 2024

@Ltwicke Better to share these data as plain text, than by a screen photo. Request the script to write a .xyz file .or. to copy-paste it -- it is less prone to errors in transmission, and faster to then replicate your observations. Post it as a fenced code block (parlance of [GitHub flavored] markdown, cf. for instance with learnxinyminutes) after a revision (vide infra).

I run an online OCR (here) with the second screen photo of yours. The initial result (initial.txt) required some edits run incrementally with regex based substitutions -- think in line of sed 's/@/0/g' initial.txt > zeroes.txt. Eventually, presuming the sequence of the atoms in your first array matches the sequence in your second numpy like array -- I i) prepended atom types to the coordinates, ii) added the line about the number of atoms, and iii) inserted comment line to the .xyz file to obtain intermediate.xyz.

An initial inspection with Jmol of the structure -- without assignment of bond orders different to one. It rises an eyebrow because the motif of two oxygen atoms next to each other in an organic molecule (i.e., an organic peroxide) usually yields an unstable molecule with tendency to decompose. Even more so if this motif occurs twice per molecule. And check the peroxide oxygen atoms depicted as trivalent, too.
Note, the two terminal carbon atoms do not fulfill valency rules; this model equally appears incomplete. Did you loose or forget a few hydrogen atoms?

intermediate


Playground: I once wrote a moderator xyz2mol_b to easier test an overall charge different from zero. Against the odds, the submission of intermediate.xyz (virtual environment, Python 3.12.6, rdkit 2024.03.6 from PyPI) accepts -c -4, i.e.

python xyz2mol_b.py intermediate.xyz -c -4 > molecule_minus4.mol

were in line with a sum of 4 negative charges. The attempt with --charge -2 (to account for four negative charges on oxygen atoms partially compensated by two positive charges on carbon atoms) failed. But still, the chemistry ...

@Ltwicke
Copy link
Author

Ltwicke commented Nov 14, 2024

Hey, sorry, your right, i was already kind of giving up on the probem. Heres an actual example from my dataset:

This is the xyz string:

'26\n\nC 3.048553 -0.031229 0.757984\nC 2.511941 -0.828616 -0.409149\nO 3.206721 -1.692353 -0.970628\nN 1.181584 -0.712026 -0.833386\nC 0.171254 0.129608 -0.366781\nC -1.139305 -0.411149 -0.153312\nC -2.131229 0.413783 0.404978\nC -1.884858 1.743819 0.752641\nC -0.644931 2.305070 0.473622\nC 0.373153 1.521099 -0.110496\nN 1.543497 2.238024 -0.538759\nO 1.750514 3.353296 -0.110506\nO 2.333591 1.685481 -1.409296\nC -1.487353 -1.827544 -0.517160\nO -0.708038 -2.533700 -1.166728\nC -2.827449 -2.391606 -0.091887\nH 2.260289 0.388497 1.376909\nH 3.664566 -0.684247 1.370458\nH 3.683724 0.777107 0.402794\nH 0.829627 -1.550721 -1.327940\nH -3.122948 0.007975 0.574823\nH -2.668454 2.343480 1.197834\nH -0.448350 3.354552 0.666391\nH -2.883419 -3.440655 -0.365380\nH -3.639944 -1.863558 -0.587756\nH -2.972832 -2.294489 0.980732'

I then proceed to use

example_mol = Chem.MolFromXYZBlock(xyz_block)
rdDetermineBonds.DetermineBonds(example_mol, charge=0)

However, it tells me:

ValueError: Final molecular charge (0) does not match input (-1); could not find valid bond ordering

If i were to change the charge to -1, then i get the following response:

ValueError: Final molecular charge (-1) does not match input (-4); could not find valid bond ordering


When using another conformer of the same molecule however, with this following xyz string:

'26\n\nC -1.935455 -1.367446 1.551091\nC -1.145513 -1.991177 0.421168\nO -1.291752 -3.188290 0.120949\nN -0.146416 -1.284075 -0.257162\nC 0.223788 0.055350 -0.112289\nC 1.608755 0.418802 -0.060006\nC 1.990547 1.758481 0.182665\nC 1.029890 2.742352 0.360649\nC -0.324359 2.412579 0.236727\nC -0.747634 1.099760 -0.019541\nC -2.198708 0.866646 -0.379629\nO -2.522873 -0.075443 -1.096743\nC -3.240425 1.855507 0.114015\nH -3.075026 2.119007 1.156450\nH -3.191555 2.773606 -0.470310\nH -4.234264 1.435943 -0.000793\nH -1.063149 3.203195 0.305317\nH 1.325072 3.765648 0.554457\nH 3.048659 1.995623 0.221612\nN 2.680117 -0.514933 -0.229924\nO 3.813063 -0.192512 0.033971\nO 2.409476 -1.707898 -0.691785\nH 0.538618 -1.888031 -0.737694\nH -1.486071 -0.452125 1.925956\nH -2.002204 -2.084458 2.365521\nH -2.951581 -1.154254 1.228369'

The code works with charge = 0.

Thank you for taking your time to look into it.

@nbehrnd
Copy link

nbehrnd commented Nov 18, 2024

@Ltwicke Thank you for sharing the update, I can replicate your findings, i.e.

  • for the first conformer, the assigned charge over all equates to minus two
  • for the second conformer, over all, the molecule is neutral

The issues here are

  1. the implementation of xyz2mol
  2. nitro group (R-NO2 where R shall be an abbreviation of the rest of the molecule) itself
  3. unequal bond lengths N-O in your models
  4. a C-N bond length to the nitro group about 0.05 A shorter, than in the reference set

over which your influence varies.

  1. The implementation checks if two atoms are close enough for a single, double, triple, aromatic bond with reference values. This (partially) reflects in the second part of the mol file, e.g.
  1  2  2
  1  3  1

reading: "between atom 1 and 2 listed in the block above, there is a double bond/bond order of 2" and (next line) "between atom 1 and 3 is a single bond". There equally is a set of (empiric) valence rules RDKit checks against. I.e. in organic molecules (like here) carbon usually establishes four bonds (or the equivalent of four bonds; e.g. one triple and one single bond), nitrogen three, hydrogens only one. See for instance https://www.rdkit.org/docs/RDKit_Book.html#molecular-sanitization

  1. The "either a single, double, triple bond between two atoms" doesn't work well enough for N-O in a nitro group. After considering formal charges (vide infra), one N-O bond were single, the other double. This would equate to one long N-O bond, and one short N=O double bond, which however isn't typically observed. Instead, both N-O bonds' bond lengths is somewhat between the two because the charge is (cloud like) dissipated all among the nitro group. (Do you now about mesomerism?)

3),4) By 2) one could assume "bond order isn't 1, nor 2, so let's try 1.5". However as far as I know, fractional bond orders are not in the scope of a .mol file. xyz2mol eventually has to pick one N-O bond as a single, the other a double bond here (plus adjustment of formal charges) and by training chemists read nitro group with point 2) in mind. With numeric thresholds only, if distances are beyond tolerances -- N-O bond lengths not equal to each other / or/and different to reference data -- an algorithm can err on bond orders. And if an oxygen is too remote vs. a nitrogen to be a "classical N=O double bond", but still close enough for a N-O single bond, the formal charge on either N or O can be erroneous, too.

The total charge of the molecule (which in case of the moderator script were the optional flag -c, for instance for -c -2) is the sum of all atoms' individual formal charge. For the later, one calculates

formal charge = sum (all valence electrons of the atom in question) minus
                (the number any pi and sigma bonds of this atom with adjacent atoms) minus
                (the number of valence electrosn of this atom which are not engaged in a bond)

Your revised data: I reset the line feeds by

sed 's/\\n/\n/g' record_01.txt > record_01.xyz
sed 's/\\n/\n/g' record_02.txt > record_02.xyz

then submitted them to the conversion moderated xyz2mol_b. Names reflect if there was an overall charge (record_01_minus2.mol) to consider, or none (record_02.xyz.mol). With nitrobenzene (as a reference model independent to your work), I used Jmol to read out the intramolecular distances (see the ball stick models, 1A = 10**-10 m), and subsequently to assign the formal charges (wireframe models).

Not knowing your work flow, I suggest

  • attempt the molecule sanitation mentioned earlier to get rid of the total of two negative charges encoded in record_01_minus2.mol (line 57) reading

    M  CHG  2  12  -1  13  -1
    

    on atom 12 and 13. Note record_02.xyz.mol reads

    M  CHG  2  20   1  22  -1
    

    i.e. a positive charge on on atom 20 and a negative on atom 22; the two compensate each other. Since atom indices can differ from .mol to .mol file, it is safer to request RDKit for this to adjust.

  • check if you can introduce a constraint into your workflow which i) would keep the nitro group fix in regard of internal distances and its distance to the rest of the molecule and ii) still allows rotation around the C-N bond, scissor like conformational change along O-N-O (a bit like IR active vibrations).

2024-11-18_check02.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants