Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GBIF vs. LatimerCore vs. MaterialSample #29

Closed
Jegelewicz opened this issue Nov 22, 2022 · 22 comments
Closed

GBIF vs. LatimerCore vs. MaterialSample #29

Jegelewicz opened this issue Nov 22, 2022 · 22 comments

Comments

@Jegelewicz
Copy link
Collaborator

I am concerned that GBIf is going to do whatever it wants and we are wasting our time - gbif/registry#247

How can we get everyone pulling in the same direction?

@albenson-usgs
Copy link

albenson-usgs commented Nov 22, 2022

To be fair that ticket pre-dates this task team so I think it's a bit strange to say that "GBIF is going to do whatever it wants". Seems to me like this is an opportunity for harmonization across multiple idea spaces on this topic 🤷

@Jegelewicz
Copy link
Collaborator Author

But GBIF has motivations and if they decide to do what they are proposing, it sorta doesn't matter what we do? I guess I am tired of working on this thing that seems like others have already worked on and I am just now seeing it (even though it was 2 or 14 years ago). And while I agree about harmonization - I am not being paid for any of this, but GBIF people are. I feel it is quite likely that they can make a decision and move on before we can even agree if we are talking about MaterialSample or PhysicalResource. I guess I am wondering if we really need this task group at this point?

@timrobertson100
Copy link
Member

timrobertson100 commented Nov 22, 2022

Obviously, I'll have a biased view as someone working at GBIF but since I collected the original info for that issue I'll reply.

gbif/registry#247 is specifically about categorizing the various different types of datasets that are registered within the GBIF network. The motivation is to respond to a desire from users to be able to filter out certain types of data, and to simplify the many summary data reports asked of us - e.g. things like "what is the growth of data published by the private sector data in my country that relates to eDNA studies compared to...". Those kind of questions are fairly specific to GBIF simply because of the diversity of datasets shared in the GBIF network and the variety of people scrutinizing it. As the vocabulary matures I guess it might be a classification scheme for datasets that people may want TDWG to standardize but I'm not sure I'd see it as an obvious candidate. It's really more a flexible set of tags to codify datasets to be able to subset the GBIF aggregation of data in different ways.

The MaterialSample group I understood is focused more on hardening up the terms and vocabularies needed to document material entities, their provenance and relationships with other kinds of entities. LatimerCore is about describing collections, which has some overlap but is more focused on aggregate metrics than the details of individual Material records.

I guess I am wondering if we really need this task group at this point?

I would say yes. The new model exploration which will allow us to break away from the constraints of the star schema in the DwC-A will need material-specific terms, such as a material type vocabulary. Those seem likely to be important for many groups beyond GBIF's wishes so make sense to standardize in TDWG.

@Jegelewicz
Copy link
Collaborator Author

@timrobertson100 thanks for the reply, but

categorizing the various different types of datasets that are registered within the GBIF network

Is part of what LatimerCore is supposed to be doing? In my view, a dataset and a collection are equivalent when it comes to museum stuff. My fear is that the expectations of GBIF publishing and LatimerCore collection descriptions will be divergent from the beginning, once again resulting in the need for collections to create multiple kinds of metadata in order to meet everyone's needs.

LatimerCore is much more than describing collections in my view and I am just feeling overwhelmed with the expectations that may be placed on individuals struggling to create and manage data over the long term. Everything is starting to feel too complex and disjointed. Probably I am just having a bad day, but I cannot seem to get all of the cats in one room at any given time and I am beginning to despair.

@Jegelewicz
Copy link
Collaborator Author

material type vocabulary

Many of these terms do that - https://github.com/tdwg/cd/labels/Class%3AObjectGroup

@albenson-usgs
Copy link

albenson-usgs commented Nov 22, 2022

Many of these terms do that - https://github.com/tdwg/cd/labels/Class%3AObjectGroup

Sort of! As I outlined at our last meeting I do not feel these cover material samples that are NOT part of a collection very well such as eDNA water or soil samples that are discarded after being processed and not stored in a physical collection.

This is what I get concerned about when parts of the community hone in on one particular piece without considering the larger community that would have to shoehorn themselves into something that wasn't developed with that particular use case in mind.

@timrobertson100
Copy link
Member

timrobertson100 commented Nov 22, 2022

Is part of what LatimerCore is supposed to be doing?

Yes, although I understand it to be more focused on data originating from the Natural History Collections than a holistic view of all biodiversity-related open data projects. If not, GBIF would certainly promote and use it but my guess is it will inform part of what GBIF seeks. (edited to add: GBIF anticipates using Latimer Core within GRSciColl as collection description as well and helping promote that, producing tools to support it and training etc)

I am sorry you are feeling overwhelmed and feel things are disjointed. My feeling is that this group is the BEST place to help address the issue that much of the Material data currently shared using Darwin Core is poorly formatted as Occurrence data and we can help address some of that. In my mind that involves the following (some of which is underway of course):

  1. By reviewing the use of the existing DwC-A format for documenting Material records. This might involve creating a new core type for Material and guidelines for it's use.
  2. By reviewing the fields in DwC and proposing changes or additions to fields and vocabularies
  3. By contributing to the design of a new data publishing model more suited to Material (and their various relationships) that is not constrained by the star-schema of the DwC-A. My biased self believes the work GBIF has begun on this could be a good starting point (which of course you have influenced greatly @Jegelewicz )

@Jegelewicz
Copy link
Collaborator Author

All of that makes sense to me, BUT I do think that the "object stuff" that is describing WHAT is held by a collection in LatimerCore would ideally be a roll-up of each THING the collection holds. If that is true, then each THING needs to include the terms used in LatimerCore if one expects to get good information at the collection level (I don't think the LatimerCore people see it that way, but that is how I would figure out what to put in some of their terms). Likewise, if a dataset is to include CATEGORY, then each THING in the dataset should include the CATEGORY (which is, I think, what is proposed in the GBIF issue that started this).

I definitely see an overlap between the CATEGORY stuff and the object description stuff and I want to ensure that collections don't have to add extraneous numbers of terms in order to publish usable data. I hope that makes sense, because right now it looks like we could be creating things that are sort-of overlapping "categorization" terms that EVERY record published will need to carry. I don't think we can afford that!

@RogerBurkhalter
Copy link

@Jegelewicz I think the discussion is healthy but avoids a serious division in the data (as I see it). Natural History collections hold vouchered entities. I see vouchered entities (my bias) as a "gold standard" because they are held in perpetuity and are not fleeting observations or based on discarded material. As such, the voucher is always available for study and re-study. I do not mean to minimize non-vouchered material, it certainly has and retains scientific value. Perhaps a simple addition making clear that an entity is vouchered or not is in order. I understand that not all of the recorded data at a natural history museum is vouchered in that some collections may have machine or human observational data, non-vouchered molecular data, or eDNA data. These data, which I understand dominate the GBIF dataset, are mostly immutable because no true vouchers exist.

In the case of vouchered entities, the data is a proxy of an object stored on a shelf or cabinet, as revisions occur to the understanding of a species, genus, family or whatever classification, those data can be updated with an examination of the actual entity and the data modified to reflect those opinions. Data based on the vouchered entities reflect current knowledge of our biological and paleontological world. Non-vouchered samples may be better for ecological, population dynamics or environmental studies, which are the main data environmental researchers want. Flagging vouchered vs. non-vouchered may be advantageous to each respective group.

@stanblum
Copy link
Member

Just quickly (I hope); a few points:

I don't think our deliberations have been pointless or misdirected, given what is going on in other Interest or Task Groups, or GBIF, or any other organization. I do think we tend to get distracted by issues that aren't that important. (To me the "sample" versus "object" versus "specimen" debate is one of those discussions.) If our recommendations are well-formed and easily understood, they will be ratified and followed by the rest of our community. (Though some reconciliation might be needed.)

I would like to see us stay focussed on the main issue(s) at hand: what are the major categories of physical stuff (aka materialSamples) that we use in biodiversity science or other kinds of natural science (yes, we are making an effort here to admit geo, physical anthro, and archeological collections; and definitely NOT excluding environmental samples that are being used in metagenomics). Our goal should be to have about 10-30 higher categories, and let the narrower disciplines within our larger community define the next level of categories. Also note, that no one should be labeling their material as a "materialSample". Rather everyone should use the most appropriate, lowest-level category.

The nature of the material and its existence in a collection are distinct concepts. I think we should look to another field/property to indicate the existence of material in a collection (e.g., disposition). An environmental sample processed and discarded, and an environmental sample deposited in a collection are both environmental samples. One is consumed, the other is preserved in some way and "accessioned." Both samples, or at least their identifiers, serve as the respective focal points for linking together downstream data. That fact that one sample can't be studied further (disposition = consumed/discarded) doesn't invalidate what it was.

@jbstatgen
Copy link

@timrobertson100 Your posts in this thread (one and two) yesterday, as well as your answer on the GBIF thread make me wonder.

A lot of development has happened within the Material Sample group over the past two months regarding terms that provide more detailed information along different dimensions of physical entities. It's ok for GBIF staff not to have caught up yet, none of the staff and associates whose work areas are overlapping with the Material Sample topics are contributing regularly. Currently, as Material Sample group we are starting to discover more connections and overlap with quite a few fellow standard groups and processes. What happens generally is that when we get in touch with them or they with us, there seems to be a shared understanding of the complexity of reality, as well as processes in a standards community maintained and driven by volunteers, leading to a response of "Let's talk" and see what we can do so that we are or get aligned.

Reality certainly is more complex, though to phrase it rather more pointedly: Your responses however seem to suggest that for you there isn't a world (at eye-level) outside of GBIF. Thus, you might have lost touch with the community, and aren't at all aware of this situation. Instead of a response of let's get together, update each other and align, it's one of that as soon as the GBIF staff gets around to the matter it will (finally) provide solutions to the insufficient, outdated standards TDWG produces (yes, the community is allowed to be grateful). Sure, you have and will take advantage of any good ideas the community might have, though, really, the great minds and break-though ideas are at GBIF.

I'm exaggerating, though this are part of the vibes that arrive on my side from your communication. These impressions aren't specific to the situation at hand, but rather something that I'm encountering across the bench in interactions with GBIF staff and members. Yet, should we have a biodiversity, climate, you name it crisis, it's important that we find ways that enable us to work together.

@cboelling
Copy link
Member

All of that makes sense to me, BUT I do think that the "object stuff" that is describing WHAT is held by a collection in LatimerCore would ideally be a roll-up of each THING the collection holds. If that is true, then each THING needs to include the terms used in LatimerCore if one expects to get good information at the collection level (I don't think the LatimerCore people see it that way, but that is how I would figure out what to put in some of their terms). Likewise, if a dataset is to include CATEGORY, then each THING in the dataset should include the CATEGORY (which is, I think, what is proposed in the GBIF issue that started this).

I definitely see an overlap between the CATEGORY stuff and the object description stuff and I want to ensure that collections don't have to add extraneous numbers of terms in order to publish usable data. I hope that makes sense, because right now it looks like we could be creating things that are sort-of overlapping "categorization" terms that EVERY record published will need to carry. I don't think we can afford that!

If I understand the above argument correctly then I fully agree that those parts of a description of a collection that descibe what kind(s) of objects make up the collection should be consistent with, or, one could also argue, should emerge from the description of the objects on the individual object level itself - possibly even using the same terms where this makes sense. To me, this would be a natural consequence of the fact that a collection basically is the sum of the objects it contains. This approach would also have attractive consequences for implementing corresponding descriptive standards in collection management software: the description of a collection could be generated from the description of the objects in it, rather than having to be created independently, saving time and resources, and ensuring congruence between object descriptions and collection descriptions.

To me this means that this task group should definitely consider what is being proposed in LatimerCore and other efforts for standardizing collection description. If, as a result, it turns out that there already object categories / properties, or entire schemes that are useful on the individual object level, by all means, let's incorporate those. This would be a welcome step towards synchronizing both levels of description with palpable benefits for implementation and use of these description schemes as described above.

On a side note, this of course doesn't, as has been hinted already in other comments, prohibit this task group (or an individual adoptor, for that matter) to add further categories, especially if these are subcategories of a given category. Such additional categories do not invalidate the rest of the description based on a given standard.

@baskaufs
Copy link

I agree with many of the previous comments, but would say that what @cboelling said above states my viewpoint almost exactly.

@baskaufs
Copy link

Please note this proposal: tdwg/dwc#421, which may help us to add clarity to the discussion.

@jbstatgen
Copy link

As I understand it, Latimer Core has the perspective that ideally information from the individual objects or entities within a "collection" should "roll-up" as @Jegelewicz described it to form the descriptions at the ObjectGroup-level, ie. of the collection. Since @essvee provided their thumbs-up to what @cboelling wrote, this perspective seems to have wider support in the Latimer Core group.

Emergence, referred to by @cboelling, describes a slightly different process in the transition from one scale to a larger scale. Here, a sum "is larger" than the sum of its parts, see the en.wikipedia entry for "Emergence".

Looking for examples, emergent properties in a context of moving from objects to groups often seem to be quality states: eg. a complete set of bones of a whale individual, the best preserved collection of ..., the most important, the only, a full, ...

Taking this a step further: "From the complete set of ice cores from drill site x on the Greenland ice shield emerges a climate reconstruction covering the past y millennia." or "The aggregate of fossilized organisms on this slab of rock represent an aquatic ecosystem".

LtC currently offers two terms within class ltc:ObjectGroup to capture these arising properties: ltc:ObjectGroup.description and ltc:ObjectGroup.thematicFocus (not part of version 1). Alternatively, classes ltc:ObjectClassification and ltc:CollectionStatusHistory can be used to describe emergent characteristics of ltc:ObjectGroups.

@smrgeoinfo
Copy link

smrgeoinfo commented Nov 28, 2022

I'd suggest that the categorization of a collection should be consistent with the kind of objects in the collection, but recognize that a collection (as I understand it) is a contingent grouping of things for some purpose. Speaking as a geologist, a collection supporting the interpretation of some past aquatic ecosystem might include not only fossils, but also rock samples used to interpret paleoclimate, paleolatitude, sediment sources, depositional environment, etc. Thus the categorization of individual sample objects should be consistent with the intention of a collection, but categorization of collections can be (not always!) orthogonal to categorization of individual samples.

@jbstatgen
Copy link

Hi @smrgeoinfo , Thanks a lot for your thoughts.

Thus the categorization of individual sample objects should be consistent with the intention of a collection, but categorization of collections can be (not always!) orthogonal to categorization of individual samples.

Orthogonal means independent, correct?

Not sure that I understand everything, though I think I agree with you. Here an example that tries to pick up what you wrote. Please let me know if that is what you had in mind.

A multidisciplinary team works on a field site. The paleontologists bring the collected fossils to their museum and deposit them in the paleontological collection. Digitized they become part of the LtC record of the museum's paleo ltc:ObjectGroup The geologists deposit their sediment rock samples in their institution's scientific collection, which expands the associated ltc:ObjectGroup focused on the institution's geological samples. The geochemists take samples at the field site and process them for all kinds of analyses, eg. isotypes. However, as @albenson-usgs pointed out, they don't store vouchers. They also produce an ltc:ObjectGroup however that one represents only a group of digital information artifacts (the results of the analyses), which are anchored by material entities that don't exist anymore, though their metadata are by now fully virtually represented.

In addition, LtC allows the construction of an ltc:ObjectGroup that runs "orthogonal" to all of the previous collections. It brings together all the materials/samples collected at the field site and their associated analyses. This ltc:ObjectGroup is a virtual representation of the work at the field site.

That virtual ObjectGroup for the field site might itself become part of (several) much large collections underlying reconstructions of paleo-ecosystems, paleo-climate, etc.

@smrgeoinfo
Copy link

@jbstatgen yes, I think we're on the same wavelength. I think your analysis highlights an interesting issue -- some ltc:ObjectGroups (e.g. the fossil collection, geologist's sedimentologic samples...) are homogeneous, the members are all the same object type. On the other hand, the virtual ObjectGroup for the field site would include samples of various sorts (fossils, rock type, digital observation results). Do these kinds of ObjectGroups need to be distinguished? (... maybe they are already?)

@RogerBurkhalter
Copy link

@smrgeoinfo I had the same question. In my CMS I distinguish between the various object types from a single sampled horizon. A 3kg sample may have micro-fossils, palynomorphs, mega fossils, whole rock geochem, thin sections, sedimentary structures (observations), isotopes (whole rock and component parts i.e. conodonts or brachiopod shell), and other information such as grain size, sorting, associated but not collected biota, etc. that I currently record as observations but could certainly be quantitative. My CMS records a UUID for the parent sample that I use to relate the various components, some may end up in associated collections that has a differing CMS (eg. Paleobotany/Micropaleontology).

@mswoodburn
Copy link

@smrgeoinfo @RogerBurkhalter there's some info on the Latimer Core wiki that might help to answer that question. We've aimed to be flexible in this area - you can either make your ObjectGroups more granular to keep them homogenous, describe larger heterogenous ObjectGroups using repeatable properties or associations to other classes, or various combinations of the two.

@debpaul
Copy link

debpaul commented Dec 2, 2022

Hi @Jegelewicz, you wrote

In my view, a dataset and a collection are equivalent when it comes to museum stuff.

Interesting, for me, this is clearly not always true, but can be true. And it's another reason why we need Latimer Core (and @mswoodburn, et al may want to add something here). I'm essentially saying (I think) what he said above, but with a concrete example.

Briefly, a (physical) collection, may be not / partly / completely digitized, right?

  • in the current EML data that museums provide and publish, the data collected about geographic, or taxonomic scope, for example, often conflate or confuse as to whether the scopes described are referring to the physical objects held in that institution, or refer to the scopes contained in the provided dataset.
  • Latimer Core makes it possible to be very clear about whether you're providing metrics about the dataset, or about your physical collection.
    • With Latimer Core then, it provides something that EML does not, that is "denominators." These denominators (when providers share them), give us the power to provide local-to-global context and metrics. In other words, with those numbers, we can more effectively say (for example) something about a museum collection's digitization status (how complete? what groups are unique to this collection compared to others? etc).
    • With Latimer Core then, it becomes possible to share more effectively what you have not digitized yet.

As we progress with digitization, more and more of the above will be deducible from the vouchered-specimen record set. Still, imagine if you provide a dataset with 100K records. We can't know you have 200K physical objects unless you tell us so (not in a text EML description -- where we can't do any math).

I hope this helps clarify at least this one feature / function and very clear purpose of Latimer Core.

Of course (and @mswoodburn @smrgeoinfo @RogerBurkhalter @stanblum allude to this next point) Latimer Core is flexible -- it has to be. And as usual, people will or have to group-the-groups differently (hah!, ya' think 😁). So the metrics Latimer Core makes possible are also going to have to be understood in their contexts and interpreted carefully. Again, metadata -- critical -- and that's what this standard is for.

As to aligning our efforts and working on compatibility, that's what we're all here for and working in-concert as best we can to align and learn from each other. It will all work out -- take heart!

@Jegelewicz
Copy link
Collaborator Author

closing as out of scope - adding terms for describing material is the next step in this process

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests