-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is mandatory #90
Comments
This discussion was continued off line. The result is given here, to make it public. I would strongly recommend that we do all discussions here, but if nobody answers I sometimes have sent individual mails to provoke a reaction. Please keep answering - we are still not finished with this point. =============================================================== From Kumaran Baskaran: I prefer your second option: Most of the softwares in NEF consortium are dealing with the data derived from the chemical shifts. In most cased these software don't use checmical shfit information at all. So we can't force them. However if they want to deposit the data to wwPDB, then checmial shifts become mandatory. My only concern is the case in which these softwares take a NEF file with checmial shift as input and write out only the derived data. ============================================================ From Rasmus Fogh: The question is whether you actually need the index column for your readers to work properly, or if you could generate it yourself when it is missing, from the order of the lines in the file. CCPN can certainly manage without it. Maybe you could run incoming files through a regularisation step? But since the index column was added specifically as a BMRB and RCSB requirement, we need to hear from you whether we must insist on that index file always being present. Regarding the 'index', if NEF file comes with proper mmCIF file, then the correct sequence order could be extracted from mmCIF file, but I haven't seen a single mmCIF(even from CCPN project), with author tags mapped to NEF atoms. =============================================================== I am a bit confused on the 'index' column for the sequence. I had originally thought that this was not going to be an index column but actually a sequence ordering column. Since in the examples I have seen, 'sequence numbering' can apparently include 6A, 6B, 6C and other possibilities, I cannot tell if this sequence numbering indicates an insertion or possible substitution at position 6 without parsing other information in the line. If I am dependent on line order for my sequence information, I would have to assume that the above was an insertion, but could it actually be a substitution (position 6 had either a THR or a SER or a ALA in the peptide)? |
On the nef_chemical_shift saveframe: As mentioned by Kumaran in the last comment, I propose a format change, so that the nef_chemical_shift saveframe is strongly recommended, mandatory for data deposition, but technically optional in the file. It would actually be quite important that people kept the chemical shift list that ultimately is the basis of any NMR result as part of their data. Could some programs keep it as a chemical shift restraint list, somehow? But, as Kumaran says, it may not be realistic to insist that programs that do not work with chemical shifts should always output this saveframe. Note that until such a time as thic change is agreed and implemented, the chemical_shift_list saveframe is still mandatory. |
As I understand it, the ordering of the rows in the _sequence loop is what determines the sequence (see specifications/Overview.md, section 3 (Molecular System), subsections 4 and 6). The index column is simply a way of reflecting the order for "implementations like (deposition) databases that do not use ordered containers for the data"; it would follow that a file that did not give the index values as increasing integers starting at 1 would be in error. And there is no mechanism for specifying chain heterogeneity, except by giving a separate chain for each alternative sequence. The index column and the name 'index' came in as a fairly recent changes on suggestion of yourself and/or John Westbrook (I think you preferred 'index' to 'ordinal'). As I understand it, again, the combination of line ordering / index and 'linking' values specifies the connectivity of the sequence, and the sequence_code gives unique author-selected identifiers for the residues. There is no alternative identifier that would match the mmCif label_seq_id, so the entire nef file will use only the sequence_code. To illustrate, the loop below would correspond to the peptide 'ACKLVD', with an extremely silly but perfectly legal choice of author sequence codes, and with an unlinked hydroxyproline sharing the same chain code. And, as you see, the index column can be inferred from the ordering. loop_ To me this raises two questions:
|
Hi,
I think there was some confusion here. I prefer 'ordinal' if the idea is
to define that the order of the rows is important, for example defining
the residue sequence for a polymer. 'Index' would seem more appropriate
when simply clearly defining a list. The index can be very useful for
STAR file parsers, where data are expressed in a table like
presentation, but a line terminating control character is not defined in
the STAR specification. They also may be useful in providing a clear
primary key for a 'table' when moving data into and out of relational
databases.
The idea being that 'ordinal' preserves the order of the rows when
exchanging data and 'index' provides unique numbering for a list.
Cheers,
Eldon
…On 2/22/17 10:21 AM, rhfogh wrote:
As I understand it, the ordering of the rows in the _sequence loop is
what determines the sequence (see specifications/Overview.md, section 3
(Molecular System), subsections 4 and 6). The index column is simply a
way of reflecting the order for "implementations like (deposition)
databases that do not use ordered containers for the data"; it would
follow that a file that did not give the index values as increasing
integers starting at 1 would be in error. And there is no mechanism for
specifying chain heterogeneity, except by giving a separate chain for
each alternative sequence. The index column and the name 'index' came in
as a fairly recent changes on suggestion of yourself and/or John
Westbrook (I think you preferred 'index' to 'ordinal'). As I understand
it, again, the combination of line ordering / index and 'linking' values
specifies the connectivity of the sequence, and the sequence_code gives
unique author-selected identifiers for the residues. There is /no/
alternative identifier that would match the mmCif label_seq_id, so the
entire nef file will use only the sequence_code. To illustrate, the loop
below would correspond to the peptide 'ACKLVD', with an extremely silly
but perfectly legal choice of author sequence codes, and with an
unlinked hydroxyproline sharing the same chain code. And, as you see,
the index column can be inferred from the ordering.
loop_
_nef_sequence.index
_nef_sequence.chain_code
_nef_sequence.sequence_code
_nef_sequence.residue_name
/nef_sequence.linking
1 A 19 HYP single
2 A -1 ALA start
3 A 1 CYS .
4 A 2A LYS .
5 A 2B LEU .
6 A 6 VAL .
7 A 5 ASP end
stop/
To me this raises two questions:
* Is this sufficiently clear from the specification, or what should we
do to clarify it?
* Can anyone propose a better way of doing this (and ultimately gather
a consensus for it)?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#90 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AVChPR1o4PQ_HTMABKFlC5R6T4nUowYAks5rfGCdgaJpZM4L5066>.
|
Hi, Thinking about it, you are clearly right, Eldon. 'ordinal' would have been a better choice for the sequence loop, and if we had thought about that while we were making the change we should have put it in that way. Where we are now, changing to ordinal would be another version change, which would need its own pull request and consensus. If you like, you can set that up yourself and I will vote in favour (this is a collaborative project - it would actually be quite good if someone other than me set up some pull requests). But for my part I think that the most urgent thing right now is to get people to implement the current format and get some test data from everybody. So I would leave the point till later, so we could get people's attention and discuss and pass a number of accumulated changes in one go. When you come to it, it should not be too hard to get people to change a single string in their code, once they agree to it. Yours, |
Recently, I have seen test data with at least three mandatory elements mising:
the 'index' columns on restraints and peaks.
and
save_nef_chemical_shift_list
Clearly people cannot be prevented from (ab)using the format in whatever way they prefer, internally. The question, especially for the BMRB, is what will happen if any of these mandatory elements are missing.
The format_version is clearly necessary - otherwise we do not know which reader to use.
The index columns are less certain. Since they are in effect line numbers they can be inferred from the file.
Will the BMRB refuse to accept files without indices? Or should we reconsider whether they are mandatory?
Finally, the chemical_shift_list, which is mandatory in order to get a list of the atoms used, instead of having to dig them out of restraints lists. The thing is that it is not mandatory that it be complete (that would be impossible to enforce), and people who do not have shifts will understandably be slow to add shift lists containing no useful shift values. Any comments?
The text was updated successfully, but these errors were encountered: