-
Notifications
You must be signed in to change notification settings - Fork 0
document mapping between NMDC-style GFF3 and schema annotation component #184
Comments
Thanks so much, @cmungall . This is really helpful. Yes, col 9 is the annotation column. I have looked into the Python source code that is generated from the schema file and created a separate repository (microbiomedata/pynmdc) so that people can try out the code easily. You, @wdduncan @scanon have been added to this repo, among a few others. What I need help with is how to:
Thanks! |
I was able to get the string "SO:0000316" from the SO-Ontologies but I still need to see an example of how to parse the correct mapping object to "type." Thanks, @cmungall @wdduncan. |
@hubin-keio - all you need to do is fill in the SO ID for the type. Let me know if you want help mapping other types (will we have other types than CDS?) For 3, 'encodes' is the relationship between a feature like a CDS and the protein |
Thanks, @cmungall. I tried it using the schema (nmdc.py). Below is the code and error message: nmdc_gf = schema.GenomeFeature( ERROR: test_GenomeFeature (main.testMetadata)Traceback (most recent call last): |
Ah good catch, it looks like the schema was using a ControlledTermValue class which is a container for a OntologyClass. I'm fixing to just use OntologyClass directly. With my PR your code should work (the point of the container class is for biosample attributes, where every attribute assignment has specific provenance attached, and also allows storage of unnormalized string forms of structured values) Aside: I didn't know you were using the python object model - great! But if you like you can just create json objects directly with Python dicts, the choice is yours. |
Thanks. I just checked out branch issue-184-cv and it solved the problem of assigning string value to "type." For the other properties, how do I assign other features/properties to GenomeFeature objects, @cmungall ? Using the example you provided above, these features are pasted below. I was thinking using the Python Object Model generated from the schema may provide better data consistency. I can add separate functions to translate GenomeFeature objects to JSON.
|
Currently the schema doesn't support translation table, start_type. I don't see a use case for these in the immediate future so I just we proceed incrementally - don't include in the json for now, and return to this later. For the source field, we have a provenance model (prov). If it's OK I'll return to describing how to fit in the program used for prediction into this. We'll also want to record provenance on each individual functional annotation (whether it comes from prokka, an hmm, ...) but again I suggest returning to this. For all of the functional annotations, I suggest we use IDs using standardized identifiers.org / n2t.net prefixes. These are included in the yaml and also visible on the html docs. E.g. https://microbiomedata.github.io/nmdc-metadata/docs/OrthologyGroup
so rather than KO:K12960 we would use the more standard KEGG.ORTHOLOGY:K12960 Tip for aim3: all identifiers in NMDC should be resolvable via identifiers.org or n2t.net |
is this correct? http://supfam.org/SUPERFAMILY/51338 gives 404 is it this? or this? |
Fixed annot schema to use OntologyClass rather than the ControlledTermValue holder object, see #184
on the call I volunteered me/@deepakunni3 to help @hubin-keio with the gff->json transform to help we would need some sample gff3 files. This is what I have from a random img analysis, is this representative?
|
@cmungall Here are a a few of the 138661 lines of 1781_1000325_functional_annotation.gff (from /global/project/projectdirs/m3408/ficus/pipeline_products/1781_100325/annotation/ ). This for one of the current Stegen metaG annotation workflow outputs.
|
@hubin-keio has provided examples here: https://github.com/microbiomedata/pynmdc/tree/main/src/nmdc/test_data |
Can you (@cmungall) provide a complete JSON version of the original example (Ga0185794_41)? I have pulled 1000 lines of a gff file from an early run of the annotation workflow and it is available here: https://github.com/microbiomedata/pynmdc/tree/main/src/nmdc/test_data/MetaG_annotation I am still working on the converter. The unfinished version is here: I would like to see a standard JSON output example before finalize the converter. Thanks. |
@hubin-keio Quick observations:
For some examples of the JSON, see here: |
I added Deepak's examples to the repo in the examples folder: https://github.com/microbiomedata/nmdc-metadata/tree/master/examples (not we also validate against all examples in this folder as unit tests and within github/travis CI) |
Thanks for the comments. @deepakunni3, is your parser working? I have committed the last planned update before GSP this morning. |
The "was_generated_by": "N/A"" field is still there in your examples. Maybe you want to remove it in your code? |
Yes, the "N/A" was a placeholder to remind us that this information is missing and needs to be incorporated. Will remove from the script. |
In the discussions in Aim1_standards channel it was mentioned on 1/9 "yes, never use values like "N/A", always make it an explicit json null, or simply omit the key altoogether." But I am fine with your parser solution as long as it is okay among Aim 1 and 3. Please put in Aim 2 channel the location of your parser once it is done so that we can process the GFFs. Aim 3 needs the JSONs ready by this Friday (1/15). |
I am not sure what the expectation here is between your pynmdc converter vs my GFF3 converter. Perhaps we can talk more on the technical call today. Regarding the "N/A", thanks for clarifying. That makes sense. I can replace that with |
the we will better document this in the schema (this answers @scanon's Q on the tech sync call) |
The schema has inline docs detailing the mapping but we should provide a higher level guide. I will sketch out in this ticket and then this can be turned into docs on the site. I'm doing quickly so if anything is confusing, it's likely I made a mistake. I will also give examples in yaml but the json cognate should be obvious
Example
This GFF line represents the output of structural annotation (the prediction of a CDS on sequence Ga0185794_41). This is given a protein ID (skolemized from reference + coordinates).
Ga0185794_41_48_1037
. Col9 represents the outputs of functional annotation. @scanon @hubin-keio do I have this right?The core feature would be represented as an instance of
https://microbiomedata.github.io/nmdc-metadata/docs/GenomeFeature
so our initial object would look like:
Note all IDs are prefixed to conform to the NMDC identifiers standards doc
Note the 'encodes' field, to link to a GeneProduct (this will always be a Protein for the current pipeline, but in future we may have ncRNA annotations)
https://microbiomedata.github.io/nmdc-metadata/docs/GeneProduct
Currently the GeneProduct field is fairly bare, but in future additional fields could be added - e.g. AA seq. The GeneProduct is what functional annotations are attached to
https://microbiomedata.github.io/nmdc-metadata/docs/FunctionalAnnotation
Each ';` separated section in col9 of the GFF would correspond to a separate annotation. The annotation links the gene product to the controlled term
Minimally this looks like:
For people familiar with the GO annotation system, each entry here represents a line in GAF or GPAD format files.
There would be one entry for each annotation. Above I am only showing the KO annotation
Note the string KO:K12960 is a key to a controlled term object. An example would be:
This is a minimal representation of the KEGG KO object. It can be linked to other controlled term objects. For example, it can include mappings to other systems (EC), parent/child hierarchies, links to pathways, etc, and these can all be traced to compounds chemical entities. However, we will only support simple KO search for GSP, so I will not detail that here. Please refer to #176 for using pathway knowledge to implement more advanced search.
Annotations to other systems (e.g. Pfam) are handled analogously. Please see the
id_prefixes
field in the schema to see the canonical ID prefix for each system.TODO: document how the feature connects to the metaG/T output
to be discussed
The text was updated successfully, but these errors were encountered: