Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is mandatory #90

Open
rhfogh opened this issue Feb 7, 2017 · 5 comments
Open

What is mandatory #90

rhfogh opened this issue Feb 7, 2017 · 5 comments

Comments

@rhfogh
Copy link
Contributor

rhfogh commented Feb 7, 2017

Recently, I have seen test data with at least three mandatory elements mising:

  _nef_nmr_meta_data.format_version

the 'index' columns on restraints and peaks.

and

save_nef_chemical_shift_list

Clearly people cannot be prevented from (ab)using the format in whatever way they prefer, internally. The question, especially for the BMRB, is what will happen if any of these mandatory elements are missing.

The format_version is clearly necessary - otherwise we do not know which reader to use.

The index columns are less certain. Since they are in effect line numbers they can be inferred from the file.
Will the BMRB refuse to accept files without indices? Or should we reconsider whether they are mandatory?

Finally, the chemical_shift_list, which is mandatory in order to get a list of the atoms used, instead of having to dig them out of restraints lists. The thing is that it is not mandatory that it be complete (that would be impossible to enforce), and people who do not have shifts will understandably be slow to add shift lists containing no useful shift values. Any comments?

@rhfogh
Copy link
Contributor Author

rhfogh commented Feb 22, 2017

This discussion was continued off line. The result is given here, to make it public.

I would strongly recommend that we do all discussions here, but if nobody answers I sometimes have sent individual mails to provoke a reaction.

Please keep answering - we are still not finished with this point.

===============================================================

From Kumaran Baskaran:

I prefer your second option:
"We can change NEF to make them optional, strongly recommend their inclusion, and say that they are mandatory for BMRB and wwPDB deposition"

Most of the softwares in NEF consortium are dealing with the data derived from the chemical shifts. In most cased these software don't use checmical shfit information at all. So we can't force them. However if they want to deposit the data to wwPDB, then checmial shifts become mandatory. My only concern is the case in which these softwares take a NEF file with checmial shift as input and write out only the derived data.

============================================================

From Rasmus Fogh:
I do not think it would work to get the sequence only from the mmCif file. The NEF file does give the sequence order, simply by the (significant) order of the lines in the sequence loop. So if that order does not match the mmCif file, we should raise an error, not just ignore the nef file order.

The question is whether you actually need the index column for your readers to work properly, or if you could generate it yourself when it is missing, from the order of the lines in the file. CCPN can certainly manage without it. Maybe you could run incoming files through a regularisation step? But since the index column was added specifically as a BMRB and RCSB requirement, we need to hear from you whether we must insist on that index file always being present.

Regarding the 'index', if NEF file comes with proper mmCIF file, then the correct sequence order could be extracted from mmCIF file, but I haven't seen a single mmCIF(even from CCPN project), with author tags mapped to NEF atoms.

===============================================================
From Eldon Ulrich:

I am a bit confused on the 'index' column for the sequence. I had originally thought that this was not going to be an index column but actually a sequence ordering column. Since in the examples I have seen, 'sequence numbering' can apparently include 6A, 6B, 6C and other possibilities, I cannot tell if this sequence numbering indicates an insertion or possible substitution at position 6 without parsing other information in the line. If I am dependent on line order for my sequence information, I would have to assume that the above was an insertion, but could it actually be a substitution (position 6 had either a THR or a SER or a ALA in the peptide)?

@rhfogh
Copy link
Contributor Author

rhfogh commented Feb 22, 2017

On the nef_chemical_shift saveframe:

As mentioned by Kumaran in the last comment, I propose a format change, so that the nef_chemical_shift saveframe is strongly recommended, mandatory for data deposition, but technically optional in the file. It would actually be quite important that people kept the chemical shift list that ultimately is the basis of any NMR result as part of their data. Could some programs keep it as a chemical shift restraint list, somehow? But, as Kumaran says, it may not be realistic to insist that programs that do not work with chemical shifts should always output this saveframe.

Note that until such a time as thic change is agreed and implemented, the chemical_shift_list saveframe is still mandatory.

@rhfogh
Copy link
Contributor Author

rhfogh commented Feb 22, 2017

As I understand it, the ordering of the rows in the _sequence loop is what determines the sequence (see specifications/Overview.md, section 3 (Molecular System), subsections 4 and 6). The index column is simply a way of reflecting the order for "implementations like (deposition) databases that do not use ordered containers for the data"; it would follow that a file that did not give the index values as increasing integers starting at 1 would be in error. And there is no mechanism for specifying chain heterogeneity, except by giving a separate chain for each alternative sequence. The index column and the name 'index' came in as a fairly recent changes on suggestion of yourself and/or John Westbrook (I think you preferred 'index' to 'ordinal'). As I understand it, again, the combination of line ordering / index and 'linking' values specifies the connectivity of the sequence, and the sequence_code gives unique author-selected identifiers for the residues. There is no alternative identifier that would match the mmCif label_seq_id, so the entire nef file will use only the sequence_code. To illustrate, the loop below would correspond to the peptide 'ACKLVD', with an extremely silly but perfectly legal choice of author sequence codes, and with an unlinked hydroxyproline sharing the same chain code. And, as you see, the index column can be inferred from the ordering.

loop_
_nef_sequence.index
_nef_sequence.chain_code
_nef_sequence.sequence_code
_nef_sequence.residue_name
nef_sequence.linking
1 A 19 HYP single
2 A -1 ALA start
3 A 1 CYS .
4 A 2A LYS .
5 A 2B LEU .
6 A 6 VAL .
7 A 5 ASP end
stop

To me this raises two questions:

  • Is this sufficiently clear from the specification, or what should we do to clarify it?
  • Can anyone propose a better way of doing this (and ultimately gather a consensus for it)?

@elulrich
Copy link

elulrich commented Feb 22, 2017 via email

@rhfogh
Copy link
Contributor Author

rhfogh commented Feb 28, 2017

Hi,

Thinking about it, you are clearly right, Eldon. 'ordinal' would have been a better choice for the sequence loop, and if we had thought about that while we were making the change we should have put it in that way. Where we are now, changing to ordinal would be another version change, which would need its own pull request and consensus. If you like, you can set that up yourself and I will vote in favour (this is a collaborative project - it would actually be quite good if someone other than me set up some pull requests). But for my part I think that the most urgent thing right now is to get people to implement the current format and get some test data from everybody. So I would leave the point till later, so we could get people's attention and discuss and pass a number of accumulated changes in one go. When you come to it, it should not be too hard to get people to change a single string in their code, once they agree to it.

Yours,
Rasmus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants