You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A relevant question is where the entrypoint for such a catalog would/could be. This has been discussed in #11. See specifically the comment: #11 (comment). This suggests the archive dataset to be the entrypoint for a catalog, although this does not necessarily have to be the case for the relevant workflow to generate metadata.
Workflow entrypoint
Let's say we assume a recursive super-sub-dataset hierarchy as follows:
For builder metadata, there's a recent issue (Implement extractor for builder metadata #92) to create a builder metadata extractor (mainly from the singularity recipe). For the catalog specifically, this only has to be extracted once (per dataset version) and the catalog will take care of representing it as a linked subdataset of both the distribution and package datasets.
For distribution metadata, the source of metadata is an open question. Technically, this can also come from the extracted metadata of the builder, since there is a one-to-one relationship between a builder and distribution. This means we do not necessarily need a distribution-dataset extractor, and could use some sort of adapted aggregation/extraction process to have this info on the dirtibution-dataset level.
For archive metadata, what needs to be represented here?
(PS. it is assumed that metalad_core extractor will be run on the dataset and file level for all datasets in this hierarchy in order to be able to represent dependent dataset linkage as well as file trees).
Taking these levels of metadata into account, it could be straight forward to run a workflow that traverses the hierarchy in a top-down direction and extracts relevant dataset- and file-level metadata at each level.
Examples of relevant WIP implementations or related issues:
FAIRly big catalog workflow (this also includes metadata translation to the catalog schema - additional translators would have to be implemented for debian-related metadata)
Are there any other sources of metadata on the distribution and archive level that isn't mentioned here
Is the important provenance information related to package builds contained within the package-level datalad dataset? Asking since this would be useful metadata to extract as well (using runprov extractor?) and to represent in a catalog.
Where should metadata be aggregated to?
Where should metadata be added and stored? Either for pure metadata storing purposes, or for the purpose of generating a catalog?
The text was updated successfully, but these errors were encountered:
So, to summarise my understanding, we'd be able to find multi-level metadata from different sources, including the package datalad datasets themselves as well as from a distribution level "Packages" file.
This suggests that it might be useful to have an extractor for the "Packages" file, but raises the question of what the resulting metadata will look like and how/where it will exist, since it references multiple packages that might already be linked as subdatasets of the relevant distribution. I.e., should we generate separate package-specific metadata items from such a "Packages" file, and should these items reference the specific datalad_id of the package datalad dataset that they relate to?
Or should all of the info extracted from a "Packages" file just remain distribution dataset level metadata?
For representing in something like a catalog.
Catalog entrypoint
A relevant question is where the entrypoint for such a catalog would/could be. This has been discussed in #11. See specifically the comment: #11 (comment). This suggests the
archive
dataset to be the entrypoint for a catalog, although this does not necessarily have to be the case for the relevant workflow to generate metadata.Workflow entrypoint
Let's say we assume a recursive super-sub-dataset hierarchy as follows:
(note that
as explained in this comment, but this is ignored in the short term).
So, for metadata extraction, where does metadata related to any particular type of dataset come from?
And can this information be extracted using a dataset- or file-level extractor with metalad? If we start from the bottom (kind of):
distribution
andpackage
datasets.(PS. it is assumed that
metalad_core
extractor will be run on the dataset and file level for all datasets in this hierarchy in order to be able to represent dependent dataset linkage as well as file trees).Taking these levels of metadata into account, it could be straight forward to run a workflow that traverses the hierarchy in a top-down direction and extracts relevant dataset- and file-level metadata at each level.
Examples of relevant WIP implementations or related issues:
easy-catalog-extract
functionality for one-command extraction and generation datalad/datalad-catalog#91Open questions
runprov
extractor?) and to represent in a catalog.The text was updated successfully, but these errors were encountered: