Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metalad-Hackathon: apply metalad to a datalad-debian distribution dataset #265

Open
christian-monch opened this issue Jul 13, 2022 · 1 comment
Assignees

Comments

@christian-monch
Copy link
Collaborator

christian-monch commented Jul 13, 2022

Why?

Metalad has gained a lot of functionality that supports anticipated use-cases. It could be very valuable to apply it to metadata processing in the context of the datalad-debian project in order to:

  • verify its general applicability
  • identify shortcomings and useful extensions
  • gain more feedback on usability
  • increase metalad knowledge in the group
  • identify bugs (very few ;-))
  • progress with datalad-debian

What?

Create a datalad-debian distribution dataset and use metalad on it to extract, process, and create all metadata that is required for further processing, e.g. for website/catalog creation. This would include the following high-level tasks:

  1. Define metadata that should be generated from distribution datasets. For example, metadata that is required by catalog for web-site rendering (Explore and implement multi-level metadata extraction and aggregation workflow psychoinformatics-de/datalad-debian#93)
  2. Create a representative distribution dataset. In the best case that would contain our packages for juseless or the neurodebian packages. Alternatively we could use a sizeable, representative subset of the 100.000 packages in http://deb.debian.org/debian/pool/main
  3. Improve the datalad-debian extractor to emit the necessary metadata (this is a work in progress: WIP: implement package metadata extractor psychoinformatics-de/datalad-debian#112).
  4. Add additionally required extractors (maybe Implement extractor for builder metadata psychoinformatics-de/datalad-debian#92)
  5. Write a pipeline or implement filters to perform the necessary metadata-processing.
  6. Identify pipeline execution improvements, e.g. better parallelization, meta-configuration. Improve pipeline configurability #214

Metalad info

TLDR; a very short explanation of metalad concepts and what kind of infrastructure is available in metalad. More info: in the project's README.md and this Gist.

Most metalad-commands are "self-contained", i.e. they perform exactly one task. For example, meta-extract will execute an extractor on a dataset or file and output the extracted metadata, as a JSON-string, to stdout. The command meta-add reads a JSON-string that defines metadata and adds it to the dataset. Higher-level operations can be created by combining lower-level operations. For example, you can pipe the output of meta-extract into meta-add and so combine metadata extraction and adding.

Pipelines

Another method for combining metalad-commands in a flexible way is pipelines. A pipeline typically consists of an element that creates data (provider), a number of elements that process data (processors), and optionally a component that gathers data from processors and stores it (consumer). The individual processors are usually existing components. e.g. extractors, that are wrapped into a thin data-routing layer, which takes care of selecting the correct data from previous pipeline elements as input for the current pipeline element.

Pipelines are very flexible but also complex. There are some predefined pipelines that, for example, perform extraction of metadata from a dataset and its sub-datasets and are able to get and drop data on demand.

Filter

While pipelines map from arbitrary data (defined by the producers) to metadata, filters are used to process metadata and create metadata, i.e. they map from metadata to metadata. For example, filters can be used to implement k-anonymity (there is an example filter in the datalad-metalad repository that performs the first step for k-anonymization, i.e. it assembles all values of all metadata fields of a dataset). Filters might come in handy when combining (I use "combining" to avoid the "aggregating" because that has a different connotation in metalad) different metadata-records into new metadata-records.

More info

More info on metalad can be found on the project homepage README.md and this Gist

Who?

Everybody who is interested in metalad, or in datalad-debian. The available tasks will vary widely in their nature. There will be a need to analyze the required metadata structure and the processes that are necessary to create them. There will be coding tasks involving metadata extractors and possibly metalad-core components. There might be conceptualization rounds, e.g. for pipeline execution machines or pipeline definition systems.

When?

This is definitely a two-day hackathon. On-boarding will probably take half a day.
Sometime in the next weeks would make the most sense because I would like to finalize version 1.0 of metalad and all observations are valuable. As part of the ongoing datalad-metalad development, I will proceed as outlined in the section "What?" above. That will be slower and probably not as insightful though. Nevertheless, the hackathon will still be valuable later. It will definitely help to familiarize the team a little more with datalad, although the target of the application might have it be changed.

Where

Online, using Jitsi

@christian-monch christian-monch changed the title Metalad-Hackathon: apply metalad to datalad-debian Metalad-Hackathon: apply metalad to a datalad-debian distribution dataset Jul 13, 2022
@christian-monch christian-monch self-assigned this Jul 13, 2022
@Manukapp
Copy link

I am interested!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants