-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best practices and informed opinions on Event and Occurrence ID's in aligned, Darwin Core data #261
Comments
Follow ups, in the order they were sent on Slack and starting with the person who sent it: @ymgan: I feel you Emilio … I have had very long IDs too. This reminds me of the discussions on this thread: tdwg/dwc#491 Sorry that I don’t have any solutions for this @emiliom: Thanks for pointing to that discussion, @ymgan It looks really helpful. I'm not expecting solutions here; just pearls of wisdom and a sense of what's been found to be most useful and practical. @jdpye: My strong preference for meaningful, data-derived IDs comes from a few places, the history of practice at OTN for making meaningful ID fields, the tendency of researchers to use very generic internal IDs for the components of their studies, and my need to find and amend records throughout the pipeline when source data changes. UUIDs for the sake of guaranteeing uniqueness feels like we are avoiding the work of defining the set of things that makes our record unique. There's no performance penalty for having long ID fields, and they save us far more often than they would ever hinder us as human operators. So it's true, I've never been convinced by the UUID advice. tdwg/dwc#491 (comment) this guy knows what's up. @albenson-usgs: Yeah no need for me to rehash what I already said in that DwC thread but just to say that I think this is a topic that is still very fraught and unresolved. My preference at this particular time is for human-resolvable IDs but I know that's not everyone's preference. @timvdstap: For what it's worth, I'm on the same page as Abby and Jon! @ymgan: +1 from me! If they have an occurrence table in their database and adding a UUID field is easy, then we go for that. However, there were times where data provider do not use Occurrence table in their database, but rather constructs the occurrence view table by joining multiple tables. I couldn’t find a way to track this with UUIDs every time they update the dataset. In this case, we asked our data provider to use the columns that are least likely to change (NOT institutionCode coz institute could be renamed, NOT triplets) to create a composite identifier for occurrenceID. Not everyone’s preference either … |
My follow up after the input received. Thanks again for everyone's input! I've been trying to digest the input here and discussions in tdwg/dwc #491. There are just too many relevant topics that come to mind , so I'll stop trying to compile "all" relevant threads and considerations, and will list what I have:
Alright, enough on ID's! I already have work to do to lay out how my data-alignment code will need to be changed to ensure the ID's I generate on the first version submitted to OBIS are reused in future data-update versions. |
From @albenson-usgs: Great summary and resources Emilio! I would definitely advocate for putting this in the issues so the conversation can be found later. |
No surprise here, but there was already an issue on this topic in this repo, from 2021: #80 |
Memory lanes upon memory lanes. Most of my advice from back then stands, and I'm particularly fond of my field-by-field explainer in #80 (comment) |
We had missed this resource from the OBIS Manual on "Constructing and using identifier codes"! https://manual.obis.org/identifiers.html |
This thread started on the Standardizing Marine Biological Data Slack on March 20, 2024. As it's of general interest, I'm moving it here so it's accessible to others more openly.
I'm curious to hear what heuristics or rules of thumb others are using to create ID's for the aligned data. I've settled on using UUID's for occurrences and semi-intelligible ID's for events. But even for events it gets a bit crazy because I'm using a hierarchical set of event types (cruise > station visit > sample) and have tried to include some of that hierarchy into the first two types, so ID's get long; for sampling events, the data generator uses unique sample ID's, so I've reused those. I also have used a dataset prefix for event ID's in a probably silly attempt to have the ID's be kind of globally unique or at least easily recognized as belonging to the same dataset. But that also leads to long ID's, and I'm not sure if it's worth it. Thoughts? I know @jdpye had thoughts on this b/c we exchanged a couple of messages on this Slack (now hidden) ...
The text was updated successfully, but these errors were encountered: