Explore and implement multi-level metadata extraction and aggregation workflow #93

jsheunis · 2022-07-12T08:29:03Z

For representing in something like a catalog.

Catalog entrypoint

A relevant question is where the entrypoint for such a catalog would/could be. This has been discussed in #11. See specifically the comment: #11 (comment). This suggests the archive dataset to be the entrypoint for a catalog, although this does not necessarily have to be the case for the relevant workflow to generate metadata.

Workflow entrypoint

Let's say we assume a recursive super-sub-dataset hierarchy as follows:

archive
.
├── distribution
│   ├── builder
│   └── package
│       └── builder

(note that

For technical/legal reasons, this [archive] dataset may have some components organized in subdatasets (e.g., non-free)

as explained in this comment, but this is ignored in the short term).

So, for metadata extraction, where does metadata related to any particular type of dataset come from?

And can this information be extracted using a dataset- or file-level extractor with metalad? If we start from the bottom (kind of):

For package metadata, we have an issue (Implement extractor for package metadata #30) for building a package metadata extractor. This is in the works in this fork+branch.
For builder metadata, there's a recent issue (Implement extractor for builder metadata #92) to create a builder metadata extractor (mainly from the singularity recipe). For the catalog specifically, this only has to be extracted once (per dataset version) and the catalog will take care of representing it as a linked subdataset of both the distribution and package datasets.
For distribution metadata, the source of metadata is an open question. Technically, this can also come from the extracted metadata of the builder, since there is a one-to-one relationship between a builder and distribution. This means we do not necessarily need a distribution-dataset extractor, and could use some sort of adapted aggregation/extraction process to have this info on the dirtibution-dataset level.
For archive metadata, what needs to be represented here?

(PS. it is assumed that metalad_core extractor will be run on the dataset and file level for all datasets in this hierarchy in order to be able to represent dependent dataset linkage as well as file trees).

Taking these levels of metadata into account, it could be straight forward to run a workflow that traverses the hierarchy in a top-down direction and extracts relevant dataset- and file-level metadata at each level.

Examples of relevant WIP implementations or related issues:

FAIRly big catalog workflow (this also includes metadata translation to the catalog schema - additional translators would have to be implemented for debian-related metadata)
datalad-catalog issue to create a python workflow for top-down dataset hierarchy metadata extraction: ENH: create a type of easy-catalog-extract functionality for one-command extraction and generation datalad/datalad-catalog#91

Open questions

Are there any other sources of metadata on the distribution and archive level that isn't mentioned here
Is the important provenance information related to package builds contained within the package-level datalad dataset? Asking since this would be useful metadata to extract as well (using runprov extractor?) and to represent in a catalog.
Where should metadata be aggregated to?
Where should metadata be added and stored? Either for pure metadata storing purposes, or for the purpose of generating a catalog?

The text was updated successfully, but these errors were encountered:

mih · 2022-07-12T10:40:53Z

In the /archive/www subdataset, is the regular debian package dist/pool data structure. Among other things it has the full list of included packages and their versions, See for example https://neuro.debian.net/debian/dists/bullseye/main/binary-amd64/Packages

jsheunis · 2022-07-12T13:15:38Z

So, to summarise my understanding, we'd be able to find multi-level metadata from different sources, including the package datalad datasets themselves as well as from a distribution level "Packages" file.

This suggests that it might be useful to have an extractor for the "Packages" file, but raises the question of what the resulting metadata will look like and how/where it will exist, since it references multiple packages that might already be linked as subdatasets of the relevant distribution. I.e., should we generate separate package-specific metadata items from such a "Packages" file, and should these items reference the specific datalad_id of the package datalad dataset that they relate to?

Or should all of the info extracted from a "Packages" file just remain distribution dataset level metadata?

mih added the enhancement New feature or request label Jul 12, 2022

christian-monch mentioned this issue Jul 13, 2022

Metalad-Hackathon: apply metalad to a datalad-debian distribution dataset datalad/datalad-metalad#265

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore and implement multi-level metadata extraction and aggregation workflow #93

Explore and implement multi-level metadata extraction and aggregation workflow #93

jsheunis commented Jul 12, 2022 •

edited

Loading

mih commented Jul 12, 2022

jsheunis commented Jul 12, 2022

Explore and implement multi-level metadata extraction and aggregation workflow #93

Explore and implement multi-level metadata extraction and aggregation workflow #93

Comments

jsheunis commented Jul 12, 2022 • edited Loading

Catalog entrypoint

Workflow entrypoint

So, for metadata extraction, where does metadata related to any particular type of dataset come from?

Open questions

mih commented Jul 12, 2022

jsheunis commented Jul 12, 2022

jsheunis commented Jul 12, 2022 •

edited

Loading