Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submission Portal/ DH: samp_name should be required to be unique #581

Closed
3 tasks
Tracked by #587
mslarae13 opened this issue Dec 30, 2022 · 18 comments
Closed
3 tasks
Tracked by #587

Submission Portal/ DH: samp_name should be required to be unique #581

mslarae13 opened this issue Dec 30, 2022 · 18 comments
Assignees
Labels
identifiers Tickets associated with identifier updates or needs nmdc-schema-mixs-submission

Comments

@mslarae13
Copy link
Contributor

mslarae13 commented Dec 30, 2022

Issue:
samp_name (1st column, sample name) validates when the values are not unique. Like the source_mat_id (globally unique id) these should be required to be unique within a submission.

Completion:
Validation rule for samp_name is changed to require it be unique

  • Provide Mark with some valid and in-valid examples - Montana
  • Implement validation rule changes into the submission portal - Mark
  • Present updates to the team - Montana
@mslarae13
Copy link
Contributor Author

@turbomam
Is this a schema, sheets and friends, or server issue?

@mslarae13
Copy link
Contributor Author

@ssarrafan & @turbomam can we put this in for sprint 2? Jan 3-13? as part of the submission portal squad deliverable "Align validation with expected values"

@turbomam
Copy link
Member

turbomam commented Jan 4, 2023

I agree that samp_name should be unique, but don't have a direct answer to your schema/sheets and friends/server issue yet. We are currently enforcing uniqueness by applying a identifier: true assertion on slot source_mat_id. Unfortunately only one slot in a class can be the identifier.

@cmungall has encouraged me to use unique_keys to express multi-field uniqueness constraints but I have dragged my feet until now.

I'm not convinced that unique_keys will solve this problem from a pure LinkML perspective, and I'm even less confident that DataHarmonizer would pick up on the unique keys.

This is currently illegal:

source_mat_id samp_name
1 a
1 b

If we set source_mat_id and samp_name to be unique_keys, I think that will actually become legal, because there's two different composite keys.

What we really want to require is both slots are required to be individually unique:

source_mat_id samp_name
1 a
2 b

I will have to investigate some more.

Q from @sujaypatil96: could we make the parent slot from a slot group the identifier or the unique key?

@ssarrafan
Copy link
Collaborator

ssarrafan commented Jan 4, 2023

@ssarrafan & @turbomam can we put this in for sprint 2? Jan 3-13? as part of the submission portal squad deliverable "Align validation with expected values"

Sure. I will put it under to do. @mslarae13 who should this be assigned to?

@mslarae13
Copy link
Contributor Author

@ssarrafan Assign to me for now. I've added a couple sub tasks to get this complete. Then Mark for actual implementation.

@mslarae13 mslarae13 self-assigned this Jan 5, 2023
@mslarae13 mslarae13 mentioned this issue Jan 6, 2023
99 tasks
@mslarae13
Copy link
Contributor Author

mslarae13 commented Jan 6, 2023

Difficult to accomplish. LinkML only allows for 1 identifier column, and that's the source_mat_id column and also must be unique.
LinkML has unique keys to use, but they aren't checked.
Longer lived issue, backlog.

@ssarrafan with that, I recommend taking this out of the SubPort squad & this sprint, putting it into backlog. Requires LinkML updates

@turbomam , @cmungall do you agree?

@ssarrafan
Copy link
Collaborator

Removing from sprint, adding backlog label
@mslarae13

@ssarrafan ssarrafan added the backlog Issue not assigned to a sprint or not completed during a sprint. Needs to be reprioritized. label Jan 14, 2023
@mslarae13 mslarae13 moved this from 🔖 Ready to 📋 Backlog in SubPort Squad Issues Jan 27, 2023
@turbomam
Copy link
Member

samp_name is now considered a unique key. @pkalita-lbl has updated DataHarmonizer to apply the uniqueness constraint, but that has not appeared in a DH release yet.

@turbomam turbomam added nmdc-schema-mixs-submission identifiers Tickets associated with identifier updates or needs labels Feb 16, 2023
@mslarae13
Copy link
Contributor Author

The GUID (source_mat?) column should no longer be required. It's optional.. add a GUID IF you have a registered GUID. If you do not.. you'll get an NMDC minted ID.
So, we should be able to make sample_name the unique field

@pkalita-lbl will this interrupt the multiple tabs? That relies on GUID right now, right?

@pkalita-lbl
Copy link
Collaborator

will this interrupt the multiple tabs? That relies on GUID right now, right?

Yes, correct. We'll need to take that into account if we make that change.

@mslarae13
Copy link
Contributor Author

When does the NMDC identifier get minted for biosamples?

  • source_mat_id should be optional
  • rename globally unique ID back to source material id
  • multi-tab code point to samp_name

@mslarae13
Copy link
Contributor Author

nmdc biosample identifiers minting task is separate.. captured in an issue?
submission portal to mongo connection.

@pkalita-lbl
Copy link
Collaborator

Schema changes are in. Once we release a new version of submission-schema we'll need to bring it into nmdc-server and make the required code changes over there.

@ssarrafan
Copy link
Collaborator

Moving to next sprint per @pkalita-lbl

@ssarrafan ssarrafan removed the backlog Issue not assigned to a sprint or not completed during a sprint. Needs to be reprioritized. label May 19, 2023
@mslarae13
Copy link
Contributor Author

@pkalita-lbl should this be moved to in review? Or do we need another issue for the next step?

@pkalita-lbl
Copy link
Collaborator

I'd rather not make a new issue since there's a lot of good context here. Paused or Blocked pending a new submission-schema release is probably a more accurate status than In Review.

@pkalita-lbl
Copy link
Collaborator

This is still paused waiting for a new submission-schema release.

@pkalita-lbl
Copy link
Collaborator

pkalita-lbl commented Jun 14, 2023

These changes are now available on data-dev.microbiomedata.org. PR: microbiomedata/nmdc-server#977

@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in SubPort Squad Issues Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
identifiers Tickets associated with identifier updates or needs nmdc-schema-mixs-submission
Projects
Status: ✅ SubPort 1 - Done
Development

No branches or pull requests

4 participants