You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Metalad has gained a lot of functionality that supports anticipated use-cases. It could be very valuable to apply it to metadata processing in the context of the datalad-debian project in order to:
verify its general applicability
identify shortcomings and useful extensions
gain more feedback on usability
increase metalad knowledge in the group
identify bugs (very few ;-))
progress with datalad-debian
What?
Create a datalad-debian distribution dataset and use metalad on it to extract, process, and create all metadata that is required for further processing, e.g. for website/catalog creation. This would include the following high-level tasks:
Create a representative distribution dataset. In the best case that would contain our packages for juseless or the neurodebian packages. Alternatively we could use a sizeable, representative subset of the 100.000 packages in http://deb.debian.org/debian/pool/main
TLDR; a very short explanation of metalad concepts and what kind of infrastructure is available in metalad. More info: in the project's README.md and this Gist.
Most metalad-commands are "self-contained", i.e. they perform exactly one task. For example, meta-extract will execute an extractor on a dataset or file and output the extracted metadata, as a JSON-string, to stdout. The command meta-add reads a JSON-string that defines metadata and adds it to the dataset. Higher-level operations can be created by combining lower-level operations. For example, you can pipe the output of meta-extract into meta-add and so combine metadata extraction and adding.
Pipelines
Another method for combining metalad-commands in a flexible way is pipelines. A pipeline typically consists of an element that creates data (provider), a number of elements that process data (processors), and optionally a component that gathers data from processors and stores it (consumer). The individual processors are usually existing components. e.g. extractors, that are wrapped into a thin data-routing layer, which takes care of selecting the correct data from previous pipeline elements as input for the current pipeline element.
Pipelines are very flexible but also complex. There are some predefined pipelines that, for example, perform extraction of metadata from a dataset and its sub-datasets and are able to get and drop data on demand.
Filter
While pipelines map from arbitrary data (defined by the producers) to metadata, filters are used to process metadata and create metadata, i.e. they map from metadata to metadata. For example, filters can be used to implement k-anonymity (there is an example filter in the datalad-metalad repository that performs the first step for k-anonymization, i.e. it assembles all values of all metadata fields of a dataset). Filters might come in handy when combining (I use "combining" to avoid the "aggregating" because that has a different connotation in metalad) different metadata-records into new metadata-records.
More info
More info on metalad can be found on the project homepage README.md and this Gist
Who?
Everybody who is interested in metalad, or in datalad-debian. The available tasks will vary widely in their nature. There will be a need to analyze the required metadata structure and the processes that are necessary to create them. There will be coding tasks involving metadata extractors and possibly metalad-core components. There might be conceptualization rounds, e.g. for pipeline execution machines or pipeline definition systems.
When?
This is definitely a two-day hackathon. On-boarding will probably take half a day.
Sometime in the next weeks would make the most sense because I would like to finalize version 1.0 of metalad and all observations are valuable. As part of the ongoing datalad-metalad development, I will proceed as outlined in the section "What?" above. That will be slower and probably not as insightful though. Nevertheless, the hackathon will still be valuable later. It will definitely help to familiarize the team a little more with datalad, although the target of the application might have it be changed.
The text was updated successfully, but these errors were encountered:
christian-monch
changed the title
Metalad-Hackathon: apply metalad to datalad-debian
Metalad-Hackathon: apply metalad to a datalad-debian distribution dataset
Jul 13, 2022
Why?
Metalad has gained a lot of functionality that supports anticipated use-cases. It could be very valuable to apply it to metadata processing in the context of the datalad-debian project in order to:
What?
Create a datalad-debian distribution dataset and use metalad on it to extract, process, and create all metadata that is required for further processing, e.g. for website/catalog creation. This would include the following high-level tasks:
catalog
for web-site rendering (Explore and implement multi-level metadata extraction and aggregation workflow psychoinformatics-de/datalad-debian#93)Metalad info
Most metalad-commands are "self-contained", i.e. they perform exactly one task. For example,
meta-extract
will execute an extractor on a dataset or file and output the extracted metadata, as a JSON-string, to stdout. The commandmeta-add
reads a JSON-string that defines metadata and adds it to the dataset. Higher-level operations can be created by combining lower-level operations. For example, you can pipe the output ofmeta-extract
intometa-add
and so combine metadata extraction and adding.Pipelines
Another method for combining metalad-commands in a flexible way is pipelines. A pipeline typically consists of an element that creates data (provider), a number of elements that process data (processors), and optionally a component that gathers data from processors and stores it (consumer). The individual processors are usually existing components. e.g. extractors, that are wrapped into a thin data-routing layer, which takes care of selecting the correct data from previous pipeline elements as input for the current pipeline element.
Pipelines are very flexible but also complex. There are some predefined pipelines that, for example, perform extraction of metadata from a dataset and its sub-datasets and are able to
get
anddrop
data on demand.Filter
While pipelines map from arbitrary data (defined by the producers) to metadata, filters are used to process metadata and create metadata, i.e. they map from metadata to metadata. For example, filters can be used to implement k-anonymity (there is an example filter in the datalad-metalad repository that performs the first step for k-anonymization, i.e. it assembles all values of all metadata fields of a dataset). Filters might come in handy when combining (I use "combining" to avoid the "aggregating" because that has a different connotation in metalad) different metadata-records into new metadata-records.
More info
More info on metalad can be found on the project homepage README.md and this Gist
Who?
Everybody who is interested in metalad, or in datalad-debian. The available tasks will vary widely in their nature. There will be a need to analyze the required metadata structure and the processes that are necessary to create them. There will be coding tasks involving metadata extractors and possibly metalad-core components. There might be conceptualization rounds, e.g. for pipeline execution machines or pipeline definition systems.
When?
This is definitely a two-day hackathon. On-boarding will probably take half a day.
Sometime in the next weeks would make the most sense because I would like to finalize version 1.0 of metalad and all observations are valuable. As part of the ongoing datalad-metalad development, I will proceed as outlined in the section "What?" above. That will be slower and probably not as insightful though. Nevertheless, the hackathon will still be valuable later. It will definitely help to familiarize the team a little more with datalad, although the target of the application might have it be changed.
Where
Online, using Jitsi
The text was updated successfully, but these errors were encountered: