Skip to content
This repository has been archived by the owner on Apr 19, 2024. It is now read-only.

Define how the schema should be used to implement search based on metabolites, reactions, or pathways #176

Open
cmungall opened this issue Dec 11, 2020 · 5 comments
Assignees

Comments

@cmungall
Copy link
Contributor

cmungall commented Dec 11, 2020

annotation schema

For orientation: the genome feature class corresponds to the main entry in a GFF3 file, and the link to a descriptor corresponds to col9

See this doc for the source of the image

This shows a generalized annotation schema for functional annotation in NMDC (todo: link to actual schema). It is neutral w.r.t system used. The various subclasses of ControlledTerm are for different aspects of function, and may be covered differently by different systems (see the courier font text to the side of each box). E.g. KEGG has reactions, pathways, compounds, and links between them.

Very rough sketch of some user stories to help us think about how search would be implemented:

  • User is interested in ammonifying microbes in soil. They come to NMDC to find datasets or explore hypotheses.
    • Chemical search: They enter “NH3” as a search term, and they see all metaP/B/G/T datasets that are associated. They see in faceted search (from mixs metadata) this is 30% soil datasets, 20% ocean, .. they further drill down on soil. They see 80% metagenomic, 20% metaproteomics, they drill down
      • The user’s term NH3 maps to equivalent IDs in KEGG.compound, CHEBI, … if the metaB dataset is annotated with any of these it is included
      • The user’s term NH3 maps to equivalent IDs in KEGG.compound, CHEBI, …. These are linked from pathways and reactions in GO, KEGG, Rhea, MetaCyc via substrate/product relations. These pathway/reaction IDs are used in metaP/G/T annotations (linked via protein IDs)
      • In both cases, hierarchy needs to be used; E.g data annotated to L-homocysteine will be returned in queries for homocysteine or amino acid
    • Function search: They enter a function (reaction) term like “nitrogen fixation”, and see metaP/B/G/T datasets that are associated and drill down in the same way
      • The user’s search term maps to equivalent IDs in Rhea, KEGG (https://www.genome.jp/kegg-bin/show_module?M00175) and other databases.
      • We could potentially find the products (e.g NH3) and return metaB sets, but would have to be clear why this was returned
      • metaG/T/P queries would return gene products annotated to this or a descendant, and could link to datasets from here
@cmungall cmungall self-assigned this Dec 11, 2020
@jeffbaumes
Copy link

This is exactly the kind of picture I was thinking of, thanks. Makes sense that these relationships are independent of a particular system like KEGG.

Would Organic Matter Classification analysis also link to compound? It's my understanding that those analyses could include many more compounds with different IDs. Similar question for Lipidomics analysis. It's unclear whether all of these analysis types are targeted for Feb.

It would be good to start to understand the least-surprise joins that a user would expect upon search. So, for example, would searching by one Genome feature match any megaB analysis containing any Compounds linked to the Function Descriptor of the Genome feature? If you bounce around between joins enough (esp. if they are many to many), you might start capturing much more in your search than you expected.

@dehays
Copy link
Contributor

dehays commented Dec 11, 2020

@jeffbaumes Yes, both organic matter characterization and lipidomics would follow the same path through that diagram as metabolomics. And Yuri intends to include the same structure in for including compound terms for all three.

@kfagnan
Copy link
Contributor

kfagnan commented Dec 11, 2020

This picture is really helpful, thanks, Chris!

For Feb we have discussed metaG, metaT (if we've got processed data), metaP, and metaB since we should have NMDC pipelines for each of those types.

For lipidomics and organic matter, we'd like to include the data (with guidance on which data would be useful), but only show the data as being connected to a study. If there are pipelines and appropriate annotation information, then functional links would be great. I don't think it's a priority for February.

@dehays
Copy link
Contributor

dehays commented Dec 11, 2020

@jeffbaumes It looks like he still working on it, but that document Chris linked has a user story section that you might find useful

@cmungall
Copy link
Contributor Author

@jeffbaumes @dehays don't worry about the linked document for now, I updated the first comment to include the relevant text

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants