Skip to content
This repository has been archived by the owner on Apr 19, 2024. It is now read-only.

Which ID-like column from the Sample Metadata template's ProjectInformation tab? #384

Open
turbomam opened this issue Jul 26, 2021 · 3 comments
Assignees

Comments

@turbomam
Copy link
Member

turbomam commented Jul 26, 2021

This is one component of issue #375

The Sample Metadata Template has several columns that look like IDs to me, but I assume that only one or two will be filled for most rows

How do we decide which should fill the NMDC shema id slot?

  • EMSL Proposal/Study Number
  • GOLD Study ID
  • JGI Proposal ID
  • Umbrella Bio Project ID

See also
https://microbiomedata.github.io/nmdc-schema/Study.html#class-study
https://microbiomedata.github.io/nmdc-schema/id.html

Input from @wdduncan @dwinston or other welcome too!

@turbomam turbomam changed the title Which ID-like column from the template's ProjectInformation tab? Which ID-like column from the Sample Metadata template's ProjectInformation tab? Jul 26, 2021
@cmungall
Copy link
Contributor

I suggest using the gold as primary

The other should go into alternate_identifiers, and we also have specific fields for different databases:

#384

We don't have a field for emsl or jgi yet, but can add these.

@cmungall
Copy link
Contributor

Looking at spreadsheet

Remember all identifiers used in NMDC must conform to

https://microbiomedata.github.io/nmdc-schema/identifiers

Umbrella Bio Project Name NCBI Accession: PRJNA594403                                                

This isn't a name. This is the INSDC bioproject identifier. The correct prefix for this is bioproject

E.g.
https://identifiers.org/bioproject:PRJNA594403

Umbrella Bio Project ID NCBI ID: 594403                                                

I don't think we should include this

JGI Proposal ID JGI:1781                                                

I don't believe this is registered in any prefix registry. If we want to include this, we should registed. I suspect there needs additional disambiguation in the ID, either in the prefix (e.g. jgi.proposal:1781) or the local part (e.g. JGI:proposal1781)

@dehays
Copy link
Contributor

dehays commented Aug 4, 2021

Further comment on conforming identifiers:

'jgi' and 'jgi.proposal' are NOT registered CURIE prefixes. I would question whether NMDC needs to have any knowledge of JGI proposals. The supported identifiers would be GOLD study identifiers. (Which although not guaranteed to be 1:1 with JGI proposals, are usually 1:1. I believe GOLD studies can potentially span multiple JGI proposals - as proposals represent a funded unit of work.)

Similarly 'emsl' and 'uuid' are not registered CURIE prefixes so including those prefixes currently doesn't add any value. It would be good if 'emsl' was registered and that emsl prefixed identifiers were resolvable. 'uuid' describes an algorithm and not an identifier domain so it would not be registered. That doesn't mean that EMSL couldn't use one of the UUID algorithms to generate the local portion of the ID that is used within a valid FAIR identifier; i.e. emsl: where https://identifiers.org/emsl: resolved to the desired record.


@cmungall recommends using the GOLD identifier as the primary identifier for NMDC sample, study and instrument process IDs.

However - these will not always exist. We already have samples that do not exist in GOLD. These currently have emsl: or igsn: identifiers as their primary ID. And of course, for new samples, there will be no GOLD, EMSL or IGSN identifer at all. So new NMDC identifiers would need to be created.

Also, in considering of the proposal that NMDC sample identifiers be embedded in analysis identifiers, all samples in NMDC would need NMDC identifiers. These would be the primary ID within NMDC.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants