Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bubble-up publications #3

Open
mslw opened this issue Sep 15, 2023 · 2 comments
Open

Bubble-up publications #3

mslw opened this issue Sep 15, 2023 · 2 comments

Comments

@mslw
Copy link
Collaborator

mslw commented Sep 15, 2023

This is a nice-to-have eventually but not a priority at the time of writing.

The SFB catalog is structured in the following way:

landing page
└── Project (project metadata here)
    └── Research dataset (dataset metadata here)

Currently, publications can be added to a Project dataset (via user-submitted RIS / nbib or web-scraped json files) or to a Research dataset (e.g. via a tabby record). There is an expectation that tabby records would declare at least one publication.

It could be neat if the publications added to a Research dataset could be automatically reported at a level-up (Project). This of course goes beyond the "each (sub)dataset is standalone" approach of the catalog itself, but reflects the logical nature of reality (dataset-related publication is a project's publication).

As a side note, and an extended problem, making a reverse connection has also been proposed: sfb1451/metadata-catalog#24

When adding things up in the hierarchy, there is a duplication to be dealt with: a publication can be listed in project and its subdataset (maybe even 100%, which would make this issue irrelevant). The catalog does not handle this on its own, so it would be up to the script here. Can the DOIs be reliably used as identifiers to deduplicate (considering that we treat them as optional for tabby)?

@jsheunis
Copy link
Contributor

jsheunis commented Oct 6, 2023

I think two approaches are relevant:

  1. As you say, the script can be updated to let publications bubble up to (grand)parent datasets. Wrt identifying publications, I guess a step-wise approach could work, i.e. first try using DOI and if that is unavailable try some sort of text matching in the citation or title (if that exists). Or maybe there's an API that can take a citation and return a DOI?
  2. In the context of generating a catalog from linked data (see Look at catalog rendering concept from semantic data view metadata-catalog#46 and Accept and render JSON-LD metadata datalad/datalad-catalog#341), the whole concept of the catalog schema (and in the current case specifically, a publication that is a property of a dataset) will likely change. It's likely that publications will stand on their own as entities with ontology-based definitions, and that they will have many possible relationships (in the sense of semantic data triples) to other semantic entities such as datasets. In this scenario, the "bubbling up" process would probably translate to adding another triple in the graph / metadata.

Regarding timelines, I think option 1 is more sensible for a short-term deliverable.

@mslw
Copy link
Collaborator Author

mslw commented Oct 6, 2023

Or maybe there's an API that can take a citation and return a DOI?

There is api.crossref.org/works?query.bibliographic that I used through habanero python package here to scrape the non-standardized list of sfb publications. It's surprisingly good, but requires some processing - due to free-form citation nature it returns matches with scores, and sometimes e.g. publication and its preprint score similarily and have to be distinguished by type - see docstrings in that file to get a clue.

Regarding timelines, I think option 1 is more sensible for a short-term deliverable.

Yes, but I am not 100% sure that we need to go for this deliverable right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants