Metalad-Hackathon: apply metalad to a datalad-debian distribution dataset #265

christian-monch · 2022-07-13T12:33:53Z

Why?

Metalad has gained a lot of functionality that supports anticipated use-cases. It could be very valuable to apply it to metadata processing in the context of the datalad-debian project in order to:

verify its general applicability
identify shortcomings and useful extensions
gain more feedback on usability
increase metalad knowledge in the group
identify bugs (very few ;-))
progress with datalad-debian

What?

Create a datalad-debian distribution dataset and use metalad on it to extract, process, and create all metadata that is required for further processing, e.g. for website/catalog creation. This would include the following high-level tasks:

Define metadata that should be generated from distribution datasets. For example, metadata that is required by catalog for web-site rendering (Explore and implement multi-level metadata extraction and aggregation workflow psychoinformatics-de/datalad-debian#93)
Create a representative distribution dataset. In the best case that would contain our packages for juseless or the neurodebian packages. Alternatively we could use a sizeable, representative subset of the 100.000 packages in http://deb.debian.org/debian/pool/main
Improve the datalad-debian extractor to emit the necessary metadata (this is a work in progress: WIP: implement package metadata extractor psychoinformatics-de/datalad-debian#112).
Add additionally required extractors (maybe Implement extractor for builder metadata psychoinformatics-de/datalad-debian#92)
Write a pipeline or implement filters to perform the necessary metadata-processing.
Identify pipeline execution improvements, e.g. better parallelization, meta-configuration. Improve pipeline configurability #214

Metalad info

TLDR; a very short explanation of metalad concepts and what kind of infrastructure is available in metalad. More info: in the project's README.md and this Gist.

Most metalad-commands are "self-contained", i.e. they perform exactly one task. For example, meta-extract will execute an extractor on a dataset or file and output the extracted metadata, as a JSON-string, to stdout. The command meta-add reads a JSON-string that defines metadata and adds it to the dataset. Higher-level operations can be created by combining lower-level operations. For example, you can pipe the output of meta-extract into meta-add and so combine metadata extraction and adding.

Pipelines

Another method for combining metalad-commands in a flexible way is pipelines. A pipeline typically consists of an element that creates data (provider), a number of elements that process data (processors), and optionally a component that gathers data from processors and stores it (consumer). The individual processors are usually existing components. e.g. extractors, that are wrapped into a thin data-routing layer, which takes care of selecting the correct data from previous pipeline elements as input for the current pipeline element.

Pipelines are very flexible but also complex. There are some predefined pipelines that, for example, perform extraction of metadata from a dataset and its sub-datasets and are able to get and drop data on demand.

Filter

While pipelines map from arbitrary data (defined by the producers) to metadata, filters are used to process metadata and create metadata, i.e. they map from metadata to metadata. For example, filters can be used to implement k-anonymity (there is an example filter in the datalad-metalad repository that performs the first step for k-anonymization, i.e. it assembles all values of all metadata fields of a dataset). Filters might come in handy when combining (I use "combining" to avoid the "aggregating" because that has a different connotation in metalad) different metadata-records into new metadata-records.

More info

More info on metalad can be found on the project homepage README.md and this Gist

Who?

Everybody who is interested in metalad, or in datalad-debian. The available tasks will vary widely in their nature. There will be a need to analyze the required metadata structure and the processes that are necessary to create them. There will be coding tasks involving metadata extractors and possibly metalad-core components. There might be conceptualization rounds, e.g. for pipeline execution machines or pipeline definition systems.

When?

This is definitely a two-day hackathon. On-boarding will probably take half a day.
Sometime in the next weeks would make the most sense because I would like to finalize version 1.0 of metalad and all observations are valuable. As part of the ongoing datalad-metalad development, I will proceed as outlined in the section "What?" above. That will be slower and probably not as insightful though. Nevertheless, the hackathon will still be valuable later. It will definitely help to familiarize the team a little more with datalad, although the target of the application might have it be changed.

Where

Online, using Jitsi

The text was updated successfully, but these errors were encountered:

Manukapp · 2022-07-18T06:29:44Z

I am interested!

christian-monch changed the title ~~Metalad-Hackathon: apply metalad to datalad-debian~~ Metalad-Hackathon: apply metalad to a datalad-debian distribution dataset Jul 13, 2022

christian-monch self-assigned this Jul 13, 2022

christian-monch added the hackathon-pitch label Jul 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metalad-Hackathon: apply metalad to a datalad-debian distribution dataset #265

Metalad-Hackathon: apply metalad to a datalad-debian distribution dataset #265

christian-monch commented Jul 13, 2022 •

edited

Loading

Manukapp commented Jul 18, 2022

Metalad-Hackathon: apply metalad to a datalad-debian distribution dataset #265

Metalad-Hackathon: apply metalad to a datalad-debian distribution dataset #265

Comments

christian-monch commented Jul 13, 2022 • edited Loading

Why?

What?

Metalad info

Pipelines

Filter

More info

Who?

When?

Where

Manukapp commented Jul 18, 2022

christian-monch commented Jul 13, 2022 •

edited

Loading