-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add category to dataset #247
Comments
Thanks!
[Tim: Thanks - Added above!] |
Question: should 4. metagenomic (eDNA) be two separate categories? There is quite a difference in interpretation of these data, even though they are both "sequence based" @ManonGros, would you comment? [Tim Edited to add: I've split them above now, but will change again based on more comments] |
Machine observation seems like a sub category of Sampling Event. |
That's ok isn't it? Because it's multivalue a dataset can be marked as both or just sampling event, or perhaps there are cases where a machine observation would be appropriate where no real sampling protocol is used. |
This new category would be free text using the vocab server? Or are we trying to have all the categories defined? |
Revised: I'd now suggest the vocabulary server, as detailed later in this thread. |
Great! I love the idea!
[Tim: Edited with suggestions expressed here - thanks, you indeed understood what I intended!] Perhaps @thomasstjerne has some thoughts on this? |
Added Targeted species detection (PCR-based assays) |
Thanks @timrobertson100 for making me aware of the thread, very exciting. So far, I found eight likely independent variables that may determine the evidence / dataset type in GBIF. I need to meditate a bit more before presenting my views here, and happy to brainstorm / whiteboard a bit if people are available? |
Keeping track of this as well |
Hello all, I like the idea of sorting datasets and types of evidence, but I am not sure it is most attractive for users to do so using a single filter / vocabulary (but I got the feasibility as put by Tim). I drew some mind maps but don't have time to add pictures here, so just type for your consideration. I started from thinking why would users need to sort dataset / types of evidence? It is a quick way to in/exclude types of data that matter for your cases based on how the evidence was generated and its properties. I came up with 8 independent variables that cross over suggested categorization of the dataset and the basisOfRecord vocabulary as we have today. Note that I think the work independent is important here, though some of the combinations of 1-8 below are impossible in real life. I am using loose words to describe my thinking, this is not a vocabulary I am suggesting, and there are some unresolved overlaps:
Once again, this is just a capture of unfinished thoughts; it would be nice to brainstorm / whiteboard how good categorization would look like. I was thinking to slice it out as e.g. 1, 7, and 13 in the original post can be simultaneously true. If these are tags and overlap is no problem, then fine. But if this is strict filter, we may need more than only field to capture types of preservation vs. generating community vs. ways of generating vs. quantitativness etc. Feel free to discard if out of scope. I also did not find the collection of BoR discussions, which is applicable here partly. |
I assume the categorisations would come from us (at least that's how it is at the moment for citizen science datasets) but it would be great if other people could help with the curation as well. Just something to keep in mind. For example, let's say that we ask Node managers to check the datasets tagged "citizen science". We want:
|
Looking at this issue: gbif/portal-feedback#3381, |
Thanks @ManonGros
That is what this was intended to be:
(Related is that Plazi just proposed |
+1 @dmitry for one to many and using keyword tags (instead of a 1:1 core record to category) Remember also that a "dataset" (as in Darwin-Core-archive-dataset) can be a mixed bag of "evidence records" (aka core record, eg. aka occurrences) of different categories -- if a category "tag" is designed to apply to all core records in a DwC-A And that the de-normalization of the "evidence records" (core records) means that one cannot be certain of which class that a given property linked to a core record is intended to be linked to |
I really like this idea. Certainly the ALA has users who want a very simple way to select groupings of records across data providers. The group I hear this request from most are curators/researchers who ‘just’ want museum or herbarium specimens. A couple of suggestions:
|
Thanks, @dagendresen. My thinking here was to try and decouple this from the class/basisOfRecord issues in Darwin Core to be able to react to reporting/user needs quickly (e.g. introduce a new tag for datasets). Acknowledging that there can be "mixed bag" datasets, my intuition is that most users would appreciate broad filtering to e.g. "omit records that originate from datasets tagged as eDNA" even if there were a few entries in there that might be of some interest, or to produce reports (e.g. growth charts) based on e.g. data originating from datasets tagged as private-sector related. Does this seem reasonable, please?
Thanks, @elywallis - I'll add your input to the list at the top now.
I believe that was the intention, yes. I don't know the details, but I'm aware the data management team is increasingly running reports on trends using categories like this. I'll add your comments in the top list, without proposing a final decision. |
Slightly off-topic, but perhaps useful: It may not be known to many, but GBIF is progressively moving vocabularies like this into our integrated vocabulary server. This will allow data managers (e.g. including node managers @dagendresen ) to be involved in defining the concepts. Concepts can be hierarchical (e.g. finer categorizations of private data) and once a vocabulary version is released, it is picked up in the data processing pipelines. This is still evolving, but LifeStage is in production now. What this means relating to this issue, is that as we find new requirements to categorise datasets for a new report or community we see emerging, we'll have the tools in place to accommodate that without needing software developer involvement (only requires a vocabulary to be changed, and then proceed with tagging datasets). |
@timrobertson100 I would (if asked) completely agree that best practice is to avoid "mixed bag" datasets and that a "tag" to enable filter for a "purpose-of-reuse" would be very useful and welcome! And believe we could live well with such functionality not applying 100% to "mixed bag" datasets :-) (apropos -- GBIF Norway is "negotiating" with Norwegian data publishers to "break" up "mixed bag" datasets into smaller datasets that would be more homogenous) |
@timrobertson100 wrote:
Tim, can you see my <happy dance!>? At some point, we need something, a talk from GBIF, a TDWG Webinar, about this effort. I think the broader community will find it very enlightening about how we can use the data we have to improve and understand the data. |
Maybe this relates to this category and could potentially be a subcategory, but it would be great to be able to categorize datasets from e.g. drones. Other remote sensing data, e.g. radar, sonar etc. could be subcategories as well. However, drones for example can have subcategories in itself, e.g. UAV, UAS and ROV etc. To keep it simple, should tracking data perhaps be a subcategory of machine observations? |
Are catch and release style data (e.g. bird ringing) considered to be "tracking", or identifying an individual by sight (e.g. whale fin)? I genuinely don't know if that is tracking or not, but they wouldn't be machine observation. |
Alternatively: should we consider a breakdown like this (sub-categories of machine observations, or others) rather as a separate controlled/proposed vocabulary to be used under "methodology"? I do not have a full understanding of user needs here, but there seems to be a difference in purpose between setting simple, intuitive filters ("not eDNA" or "just tracking data"), and the more specialized breakdowns that serve a user being particularly interested in, say, data collected via drones. In the first case, categorizing at ingestion to serve search filters would be supporting most cases adequately, where more specific queries may be better served by supporting structured keywording of methods used in data collection (including publisher / user guidance on tagging datasets for more detailed methodological approaches). |
The purpose here, if I understand correctly, is to support users to include/exclude particular content, based on how it was derived. In that sense: I would value the fact that some users may want to exclude known, repeated observations / loggings of one and the same individual over time higher than how these data were collected "technically". |
True, they would not be machine observations so there would need to be a separation of the two. |
At what point is GBIF diverging from TDWG standards? How can we do things as a community if we are developing vocabularies in silos? How will this fit with LatimerCore and eventually whatever MaterialSample standards come out of TDWG? Sigh. |
I've left a comment on tdwg/material-sample#29 but will also note here. I'm not sure there is a TDWG standard that would cover this, but terms from various vocabularies could be used (relating to LatimerCore, Darwin Core etc). It's really intended to provide the means to codify datasets to allow easy filtering of data and driving reports on data seen in GBIF. We're asked to report on counts by e.g. private sector data etc which is probably more unique to the GBIF network than the kind of problems TDWGs current task groups cover. There is of course a large overlap between the GBIF and TDWG communities, and GBIF (staff and network) promotes, implements, and contributes to standards so it could be that one might emerge from this, but it's not immediately obvious. |
Also relevant for publishers, e.g. private sector publishers: https://docs.gbif.org/private-sector-data-publishing/2.0/en/#table-01 |
I have added the vocabulary now as DatasetCategory on UAT with the following changes:
I have added comments in brackets as |
The issues name is "Add category to dataset" and the vocabulary is called "DatasetCategory", but as I read it, it is a multi-value field at occurrence level. Maybe we should consider renaming the field and issue to reflect that? |
I read it as the main aim is to be able to provide intuitive filters for the users of the data. That is important to keep in mind, so we do not make it over-complicated. I believe Data Products / Helpdesk must have an intuitive feeling (at least) on which types of data data users most often wish to focus on / exclude, and that those categories are the ones now finding their way into the vocabulary. I have some suggestions/comments on those suggested (later...). |
private sector serves a user need much like the wish to be able to filter on thematic types of data like fresh water, health, marine this issue, where the wish is to either produce reports/growth charts OR delimit classical data types of e.g. habitat relevance. I believe it is wise to think about these needs in the same work here (not sure of they should be included in the same overall field). |
If I understand it correctly, the consensus is that this field (at occurrence level) eventually contains values that are being assigned based on some rules upon ingestion, minimizing the need for manual interaction/curation. Some thoughts on this: Should we have a first brainstorm/meeting on how such rules could be - both at a general level, but also checking that we can actually establish some rules for the categories that have been proposed already. And then start designing those rules for real. Some early thoughts/examples on what might be used for rules: simple info about known sources, e.g.:
content of selected fields, e.g.:
taxon belongs to a selected checklist
spatial rules
auto-labelling from data formatting tools and similar
Positive/negative lists based on manual curation/refinement (e.g. "no this is not citizen science although the rule suggests so" or "this IS citizen science although the rule suggests it is not") ...? And combinations of the above, including procedures like the Clustering Algorithm. Simpler rules are of course preferable, and could help refine the categories of the vocabulary? |
The field will contain information at a record level about how the dataset was compiled so it is pointing to the dataset source in a way. However, most users will not access data on GBIF by downloading specific datasets, but rather query across datasets and this is why the information has to be at record level. The original proposal was to call the field |
We could maybe have concepts like Also, are the thematic areas filter options more internal relevant or of public relevance? The scope of these categories should be for external end-users, not for internal GBIFS relevance. |
Should we create a new issue for implementation and automated categorization perhaps @timrobertson100? |
OK, then I did misunderstand. If the values/categories have to be the same across all records in a dataset, then we can of course not use the same approach for "thematic data" which varies within datasets (e.g. rats are health relevant, but not all iNaturalist is health relevant. Brown Trout is fresh water but not all iNaturalist is fresh water, ....). Also ENA/INSDC datasets have a mixture of the categories of DNA-associated data, that would make it difficult to categorize at dataset level. I understand that most datasets are of a single category, but I am not sure if I understand why the category classification needs to refer to dataset level (again with the user in perspective). Some categories will only be possible to infer (from rules) by looking at the single occurrences anyway. |
All data yes, and the themes are of user relevance (also/primarily) |
Sorry for expanding the issue into the topic on making it operational. As I indicate, the attempt to design the rules may affect the actual delimitation of categories. But no need to mix in same issue, I guess. Sorry. |
The current
Dataset
has type and subtype which is slightly problematic.Type
is really indicating the row format used in the DwC-A and causes problems since a checklist can have occurrences, and an occurrence dataset can in fact be the output of sampling event data.Better use of
SubType
may help, but I feel could add to more confusion due to the overlap (e.g. an occurrence dataset with subtype sampling event).Since the API is now so well used and changing this is disruptive, I propose to introduce a new multi-value field named
category
to categorize datasets. In time we can deprecate type and subtype.The categories would include the likes of (edited to include suggestions that came in from chat below):
a. Consider separating out fossils as a separate category, to avoid accidental misuse
a. Consider adding tissue sample as well (which may or may not be sequenced) to aid discovery of preserved tissue without drawing on ambiguous other terms
a. Consider splitting this into finer categories (e.g. proponent data for environmental impact assessment prior to development) versus other categories (to be defined)
The multiple categories would be added to each occurrence record at indexing, allowing an intuitive filter to be added in GBIF.org so people can select on/off the dataset categories that interest them.
CC @ahahn-gbif @MortenHofft for comments in particular
The text was updated successfully, but these errors were encountered: