Skip to content
This repository has been archived by the owner on Apr 19, 2024. It is now read-only.

Extend schema to cover metabolomics output to enable search by compound #170

Closed
cmungall opened this issue Dec 5, 2020 · 3 comments
Closed

Comments

@cmungall
Copy link
Contributor

cmungall commented Dec 5, 2020

C-MS based metabolomics workflow output examples from @corilo here:

https://drive.google.com/drive/u/1/folders/1_dHFvIK9PwJCKVJznwqWgqvfOIhTCkvx

Expected changes:

Metadata JSON file: structure will change after the adoption of the labels defined on AIM 1,
DataTable CSV file: metabolite names will change (removing comments and extra fields), and more columns will be added to include CAS and KEGG compound ID

NOTE: for now, make comments on @dehays google doc https://docs.google.com/document/d/15fga30d619WRxAUk8LyrojwIN1m89_K-sIrmUg4Y3tY/edit rather that commenting in ticket. We will still use this ticket to track status

As a first pass we will not try and capture everything from a metabolomics workflow, just the aspects that are necessary for search

@dehays
Copy link
Contributor

dehays commented Dec 10, 2020

Yuri @corilo - will produce JSON document for a single execution of his workflow. Then get feedback (Chris, Bill, David) and possibly iterate before continuing to do remaining metabolomics workflow instances - structure will then be used for lipidomics and OM workflow execution metadata

@corilo
Copy link
Member

corilo commented Dec 15, 2020

@dehays, @cmungall @wdduncan
Here is the first draft of the JSON document. Please provide feedback on the structure, labels, and any missing data as required.

I forgot what the "was_informed_by" and what it should contain. Please help?

https://drive.google.com/file/d/1BnP8q-iDQP2vmswN9v68WDQfNG4uDek5/view?usp=sharing

Thanks

@dehays
Copy link
Contributor

dehays commented Dec 16, 2020

"has_input": [
            "emsl:sha256:37417faf2c1b07ef9c59868683e41577bb3a745128bdc88b6cc59e579b5b30d0"
        ]

This ID (the sha256 hash) doesn't match the ID of any instrument output EMSL has provided to NMDC. This uses that dataset ID (that some parts of EMSL use and some parts do not) with a prefix of "output_". Looks like "emsl:output_500097". We can revisit how EMSL sets unique IDs, but the IDs need to be consistent so that there is a path back from analysis to sample and study.

"was_informed_by" is a relationship on an analysis execution activity (i.e. instance of running the metabolomics analysis workflow) that refers to the instrument run (OmicsProcessing) entity. Again, the ID currently looks like "emsl:500097" and uses the dataset ID of the instrument run.

The Metabolites object - I was expecting an array rather than an object, but a more important conversation is in how this will relate to the structure @cmungall is describing in #176 I hope to understand this better after speaking with Chris on Wednesday

@cmungall cmungall changed the title Extend schema to cover metabolomics output Extend schema to cover metabolomics output to enable search by compound Dec 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants