[ENH] introduce GeneratedBy to "core" BIDS

bids-standard#487 (and originally bids-standard#439) is a `WIP ENH` to introduce standardized provenance capture/expression for BIDS datasets. This PR just follows the idea of bids-standard#371 (small atomic ENHs), and is based on current state of the specification where we have GeneratedBy to describe how a BIDS derivative dataset came to its existence. ## Rationale As I had previously stated in many (face-to-face when it was still possible ;)) conversations, in my view, any BIDS dataset is a derivative dataset. Even if it contains "raw" data, it is never given by gods, but is a result of some process (let's call it pipeline for consistency) which produced it out of some other data. That is why there is 1) `sourcedata/` to provide placement for such original (as "raw" in terms of processing, but "raw"er in terms of its relation to actual data acquired by equipment), and 2) `code/` to provide placement for scripts used to produce or "tune" the dataset. Typically "sourcedata" is either a collection of DICOMs or a collection of data in some other formats (e.g. nifti) which is then either converted or just renamed into BIDS layout. When encountering a new BIDS dataset ATM it requires forensics and/or data archaeology to discover how this BIDS dataset came about, to e.g. possibly figure out the source of the buggy (meta)data it contains. At the level of individual files, some tools already add ad-hoc fields during conversion into side car .json files they produce, <details> <summary>e.g. dcm2niix adds ConversionSoftware and ConversionSoftwareVersion</summary> ```shell (git-annex)lena:~/datalad/dbic/QA[master]git $> git grep ConversionSoftware | head -n 2 sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json: "ConversionSoftware": "dcm2niix", sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json: "ConversionSoftwareVersion": "v1.0.20170923 (OpenJPEG build) GCC6.3.0", ``` </details> ATM I need to add such metadata to datasets produced by heudiconv to make sure that in case of incremental conversions there is no switch in versions of the software.
yarikoptic · Oct 11, 2021 · 75f90e6 · 75f90e6
1 parent 94112b6
commit 75f90e6
Show file tree

Hide file tree

Showing 2 changed files with 39 additions and 23 deletions.
diff --git a/src/03-modality-agnostic-files.md b/src/03-modality-agnostic-files.md
@@ -28,9 +28,22 @@ Every dataset MUST include this file with the following fields:
       "EthicsApprovals": "OPTIONAL",
       "ReferencesAndLinks": "OPTIONAL",
       "DatasetDOI": "OPTIONAL",
+      "GeneratedBy": "RECOMMENDED",
+      "SourceDatasets": "RECOMMENDED",
    }
 ) }}
 
+Each object in the `GeneratedBy` list includes the following REQUIRED, RECOMMENDED
+and OPTIONAL keys:
+
+| **Key name** | **Requirement level** | **Data type** | **Description**                                                                                                                                                                                           |
+|--------------|-----------------------|---------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Name         | REQUIRED              | [string][]    | Name of the pipeline or process that generated the outputs. Use `"Manual"` to indicate the derivatives were generated by hand, or adjusted manually after an initial run of an automated pipeline.        |
+| Version      | RECOMMENDED           | [string][]    | Version of the pipeline.                                                                                                                                                                                  |
+| Description  | OPTIONAL              | [string][]    | Plain-text description of the pipeline or process that generated the outputs. RECOMMENDED if `Name` is `"Manual"`.                                                                                        |
+| CodeURL      | OPTIONAL              | [string][]    | URL where the code used to generate the dataset may be found.                                                                                                                                         |
+| Container    | OPTIONAL              | [object][]    | Used to specify the location and relevant attributes of software container image used to produce the dataset. Valid keys in this object include `Type`, `Tag` and [`URI`][uri] with [string][] values. |
+
 Example:
 
 ```JSON
@@ -57,37 +70,45 @@ Example:
     "Alzheimer A., & Kraepelin, E. (2015). Neural correlates of presenile dementia in humans. Journal of Neuroscientific Data, 2, 234001. doi:1920.8/jndata.2015.7"
   ],
   "DatasetDOI": "doi:10.0.2.3/dfjj.10",
-  "HEDVersion": "7.1.1"
+  "HEDVersion": "7.1.1",
+  "GeneratedBy": [
+    {
+      "Name": "reproin",
+      "Version": "0.6.0", 
+      "Container": {
+        "Type": "docker",
+        "Tag": "repronim/reproin:0.6.0"
+      }
+    }
+  ],
+  "SourceDatasets": [
+    {
+      "URL": "s3://dicoms/studies/correlates",
+      "Version": "April 11 2011"
+    }
+  ]
 }
 ```
 
-#### Derived dataset and pipeline description
+#### Pipeline description
 
 As for any BIDS dataset, a `dataset_description.json` file MUST be found at the
 top level of every derived dataset:
 `<dataset>/derivatives/<pipeline_name>/dataset_description.json`.
 
-In addition to the keys for raw BIDS datasets,
-derived BIDS datasets include the following REQUIRED and RECOMMENDED
-`dataset_description.json` keys:
+In contrast to raw BIDS datasets, derived BIDS datasets MUST include
+`GeneratedBy` key:
 
 {{ MACROS___make_metadata_table(
    {
-      "GeneratedBy": "REQUIRED",
-      "SourceDatasets": "RECOMMENDED",
+      "GeneratedBy": "REQUIRED"
    }
 ) }}
 
-Each object in the `GeneratedBy` list includes the following REQUIRED, RECOMMENDED
-and OPTIONAL keys:
-
-| **Key name** | **Requirement level** | **Data type** | **Description**                                                                                                                                                                                           |
-|--------------|-----------------------|---------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| Name         | REQUIRED              | [string][]    | Name of the pipeline or process that generated the outputs. Use `"Manual"` to indicate the derivatives were generated by hand, or adjusted manually after an initial run of an automated pipeline.        |
-| Version      | RECOMMENDED           | [string][]    | Version of the pipeline.                                                                                                                                                                                  |
-| Description  | OPTIONAL              | [string][]    | Plain-text description of the pipeline or process that generated the outputs. RECOMMENDED if `Name` is `"Manual"`.                                                                                        |
-| CodeURL      | OPTIONAL              | [string][]    | URL where the code used to generate the derivatives may be found.                                                                                                                                         |
-| Container    | OPTIONAL              | [object][]    | Used to specify the location and relevant attributes of software container image used to produce the derivative. Valid keys in this object include `Type`, `Tag` and [`URI`][uri] with [string][] values. |
+If a derived dataset is stored as a subfolder of the raw dataset, then the `Name` field
+of the first `GeneratedBy` object MUST be a substring of the derived dataset folder name.
+That is, in a directory `<dataset>/derivatives/<pipeline>[-<variant>]/`, the first
+`GeneratedBy` object should have a `Name` of `<pipeline>`.
 
 Example:
 
@@ -120,11 +141,6 @@ Example:
 }
 ```
 
-If a derived dataset is stored as a subfolder of the raw dataset, then the `Name` field
-of the first `GeneratedBy` object MUST be a substring of the derived dataset folder name.
-That is, in a directory `<dataset>/derivatives/<pipeline>[-<variant>]/`, the first
-`GeneratedBy` object should have a `Name` of `<pipeline>`.
-
 ### `README`
 
 Every BIDS dataset SHOULD come with a free form text file (`README`) describing the dataset in more detail.

diff --git a/src/schema/objects/metadata.yaml b/src/schema/objects/metadata.yaml
@@ -866,7 +866,7 @@ Funding:
 GeneratedBy:
   name: GeneratedBy
   description: |
-    Used to specify provenance of the derived dataset.
+    Used to specify provenance of the dataset.
     See table below for contents of each object.
   type: array
   minItems: 1