Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version dataset metadata independent of imports #1358

Open
yroskov opened this issue Sep 30, 2024 · 9 comments
Open

Version dataset metadata independent of imports #1358

yroskov opened this issue Sep 30, 2024 · 9 comments
Assignees
Labels
API API model related feature request Proposing a new feature currently not existing importer Dataset import related issues metadata

Comments

@yroskov
Copy link

yroskov commented Sep 30, 2024

Describe the bug

New CoL release of September 2024 contains wrong metadata for World Plants and World Ferns.

Real versions of both GSDs are 19.4, Jun 2024 / 2024-06-30. (Indeed, new data versions were imported in CLB in Spetember 2024, but they were not synced by me in the CoL of September!). However, these incorrect versions (as 24.9, Sep 2024) are shown in GSD metadata in the September release:

image

https://www.catalogueoflife.org/data/dataset/1140

https://www.catalogueoflife.org/data/dataset/1141

@yroskov yroskov added the bug label Sep 30, 2024
@yroskov
Copy link
Author

yroskov commented Sep 30, 2024

@thomasstjerne & @mdoering, could you please fix this long standing bug? GSD metadata in the CoL should reflect the version which was synced in the project, but not the version currently imported into the CLB.

(just in case, WFerns GSD was synced 2024-07-09; WPlants - 2024-07-08)

@mdoering
Copy link
Member

The September edition was released on 2024-09-25.

Ferns were last imported 18th September and in July before that:
image

The fern sectors were synced last on the 30th September:
https://api.checklistbank.org/dataset/3/sector/sync?datasetKey=1140
datasetAttempt: 66 # this is the version of the dataset import:
https://api.checklistbank.org/dataset/1140/66.json

Before that on the 9th of July.
datasetAttempt: 65
https://api.checklistbank.org/dataset/1140/65.json

The metadata for import 65 indeed looks odd:

"attempt":65,
"issued":"2024-09-18",
"version":"24.9, Sep 2024",
"created":"2024-09-18T14:01:09.739245",
"imported":"2024-07-08T14:10:37.195155"

@yroskov this problem was never mentioned to me before and I am very surprised to see this now. It was working now for more than 2 years.f

@yroskov
Copy link
Author

yroskov commented Sep 30, 2024

The fern sectors were synced last on the 30th September
Yes, it is my today's work for CoL of October

this problem was never mentioned to me before and I am very surprised to see this now. It was working now for more than 2 years.f

I raised this many times during our stands up... (especially, in relation to IRMNG)

@mdoering
Copy link
Member

I believe I know what's going on. If you download the last archives they all lack metadata!
That must be linked into wrong archival of metadata versions. I will look more into this tomorrow

@mdoering
Copy link
Member

mdoering commented Sep 30, 2024

I raised this many times during our stands up... (especially, in relation to IRMNG)

Can you point me to an old issue please?

@mdoering
Copy link
Member

mdoering commented Oct 1, 2024

Dataset metadata is only archived during imports, i.e. when no metadata is included in the archive there won't be any archival. And as the dataset metadata version is tied to the import attempt, it requires considerable refactoring to change that. The idea was that we do not want to archive every manual edit that is being done on a dataset, but instead allow manual changes via the UI or API to happen and only write a final version to the archive when a new one, through an import, shows up.

It seems we now rather need an independent metadata versioning system that has its own version number and will be triggered to archive a version when:

  • a new import with metadata happens
  • a sector sync happens with modified metadata since the last archival

Every import and sync would then refer to a specific metadata version which can be retrieved from the archive.

@mdoering
Copy link
Member

mdoering commented Oct 1, 2024

@yroskov @gdower a quick fix from my side is not possible, this will take longer.
Maybe we can add metadata.yaml files to these sources?

@mdoering mdoering transferred this issue from CatalogueOfLife/portal Oct 1, 2024
@mdoering mdoering changed the title Incorrect GSD metadata are applied in the CoL portal Version dataset metadata independent of imports Oct 1, 2024
@mdoering mdoering added importer Dataset import related issues API API model related feature request Proposing a new feature currently not existing metadata and removed bug labels Oct 1, 2024
@mdoering mdoering moved this to Todo in Software Development Oct 1, 2024
@mdoering mdoering self-assigned this Oct 1, 2024
@yroskov
Copy link
Author

yroskov commented Oct 1, 2024

Maybe we can add metadata.yaml files to these sources?

Unfortunately, this can happen with any source. For example, quite often we get a notification about a new ITIS and do an import a few days before the release, without including that update in the release.

...and this happen to almost all GSDs imported by "third parties" out of our control, e.g. WCVP, WFO, Bryonames, all Lepidoptera, etc.

@mdoering
Copy link
Member

mdoering commented Oct 1, 2024

but datasets with metadata in imports are versioned fine, they are not a problem!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API API model related feature request Proposing a new feature currently not existing importer Dataset import related issues metadata
Projects
Development

No branches or pull requests

2 participants