From 75f90e6ae4c5cafff2bae4a7c3900381208e5608 Mon Sep 17 00:00:00 2001
From: Yaroslav Halchenko <debian@onerussian.com>
Date: Mon, 11 Oct 2021 14:06:50 -0400
Subject: [PATCH] [ENH] introduce GeneratedBy to "core" BIDS

#487 (and originally #439) is a `WIP ENH` to introduce standardized provenance
capture/expression for BIDS datasets.  This PR just follows the idea of #371
(small atomic ENHs), and is based on current state of the specification where
we have GeneratedBy to describe how a BIDS derivative dataset came to its
existence.

## Rationale

As I had  previously stated in many (face-to-face when it was still
possible ;)) conversations, in my view, any BIDS dataset is a derivative
dataset.  Even if it contains "raw" data, it is never given by gods, but is a
result of some process (let's call it pipeline for consistency) which produced
it out of some other data. That is why there is 1) `sourcedata/` to provide
placement for such original (as "raw" in terms of processing, but "raw"er in
terms of its relation to actual data acquired by equipment), and 2) `code/` to
provide placement for scripts used to produce or "tune" the dataset.  Typically
"sourcedata" is either a collection of DICOMs or a collection of data in some
other formats (e.g. nifti) which is then either converted or just renamed into
BIDS layout. When encountering a new BIDS dataset ATM it requires forensics
and/or data archaeology to discover how this BIDS dataset came about, to e.g.
possibly figure out the source of the buggy (meta)data it contains.

At the level of individual files, some tools already add ad-hoc fields
during conversion into side car .json files they produce,

<details>
<summary>e.g. dcm2niix adds ConversionSoftware and ConversionSoftwareVersion</summary>

```shell
(git-annex)lena:~/datalad/dbic/QA[master]git
$> git grep ConversionSoftware | head -n 2
sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json:  "ConversionSoftware": "dcm2niix",
sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json:  "ConversionSoftwareVersion": "v1.0.20170923 (OpenJPEG build) GCC6.3.0",

```
</details>

ATM I need to add such metadata to datasets produced by heudiconv to make
sure that in case of incremental conversions there is no switch in versions of
the software.
---
 src/03-modality-agnostic-files.md | 60 +++++++++++++++++++------------
 src/schema/objects/metadata.yaml  |  2 +-
 2 files changed, 39 insertions(+), 23 deletions(-)
diff --git a/src/03-modality-agnostic-files.md b/src/03-modality-agnostic-files.md
index 7833f96b60..e3c88e98aa 100644
--- a/src/03-modality-agnostic-files.md
+++ b/src/03-modality-agnostic-files.md
@@ -28,9 +28,22 @@ Every dataset MUST include this file with the following fields:
       "EthicsApprovals": "OPTIONAL",
       "ReferencesAndLinks": "OPTIONAL",
       "DatasetDOI": "OPTIONAL",
+      "GeneratedBy": "RECOMMENDED",
+      "SourceDatasets": "RECOMMENDED",
    }
 ) }}
 
+Each object in the `GeneratedBy` list includes the following REQUIRED, RECOMMENDED
+and OPTIONAL keys:
+
+| **Key name** | **Requirement level** | **Data type** | **Description**                                                                                                                                                                                           |
+|--------------|-----------------------|---------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Name         | REQUIRED              | [string][]    | Name of the pipeline or process that generated the outputs. Use `"Manual"` to indicate the derivatives were generated by hand, or adjusted manually after an initial run of an automated pipeline.        |
+| Version      | RECOMMENDED           | [string][]    | Version of the pipeline.                                                                                                                                                                                  |
+| Description  | OPTIONAL              | [string][]    | Plain-text description of the pipeline or process that generated the outputs. RECOMMENDED if `Name` is `"Manual"`.                                                                                        |
+| CodeURL      | OPTIONAL              | [string][]    | URL where the code used to generate the dataset may be found.                                                                                                                                         |
+| Container    | OPTIONAL              | [object][]    | Used to specify the location and relevant attributes of software container image used to produce the dataset. Valid keys in this object include `Type`, `Tag` and [`URI`][uri] with [string][] values. |
+
 Example:
 
 ```JSON
@@ -57,37 +70,45 @@ Example:
     "Alzheimer A., & Kraepelin, E. (2015). Neural correlates of presenile dementia in humans. Journal of Neuroscientific Data, 2, 234001. doi:1920.8/jndata.2015.7"
   ],
   "DatasetDOI": "doi:10.0.2.3/dfjj.10",
-  "HEDVersion": "7.1.1"
+  "HEDVersion": "7.1.1",
+  "GeneratedBy": [
+    {
+      "Name": "reproin",
+      "Version": "0.6.0", 
+      "Container": {
+        "Type": "docker",
+        "Tag": "repronim/reproin:0.6.0"
+      }
+    }
+  ],
+  "SourceDatasets": [
+    {
+      "URL": "s3://dicoms/studies/correlates",
+      "Version": "April 11 2011"
+    }
+  ]
 }
 ```
 
-#### Derived dataset and pipeline description
+#### Pipeline description
 
 As for any BIDS dataset, a `dataset_description.json` file MUST be found at the
 top level of every derived dataset:
 `<dataset>/derivatives/<pipeline_name>/dataset_description.json`.
 
-In addition to the keys for raw BIDS datasets,
-derived BIDS datasets include the following REQUIRED and RECOMMENDED
-`dataset_description.json` keys:
+In contrast to raw BIDS datasets, derived BIDS datasets MUST include
+`GeneratedBy` key:
 
 {{ MACROS___make_metadata_table(
    {
-      "GeneratedBy": "REQUIRED",
-      "SourceDatasets": "RECOMMENDED",
+      "GeneratedBy": "REQUIRED"
    }
 ) }}
 
-Each object in the `GeneratedBy` list includes the following REQUIRED, RECOMMENDED
-and OPTIONAL keys:
-
-| **Key name** | **Requirement level** | **Data type** | **Description**                                                                                                                                                                                           |
-|--------------|-----------------------|---------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| Name         | REQUIRED              | [string][]    | Name of the pipeline or process that generated the outputs. Use `"Manual"` to indicate the derivatives were generated by hand, or adjusted manually after an initial run of an automated pipeline.        |
-| Version      | RECOMMENDED           | [string][]    | Version of the pipeline.                                                                                                                                                                                  |
-| Description  | OPTIONAL              | [string][]    | Plain-text description of the pipeline or process that generated the outputs. RECOMMENDED if `Name` is `"Manual"`.                                                                                        |
-| CodeURL      | OPTIONAL              | [string][]    | URL where the code used to generate the derivatives may be found.                                                                                                                                         |
-| Container    | OPTIONAL              | [object][]    | Used to specify the location and relevant attributes of software container image used to produce the derivative. Valid keys in this object include `Type`, `Tag` and [`URI`][uri] with [string][] values. |
+If a derived dataset is stored as a subfolder of the raw dataset, then the `Name` field
+of the first `GeneratedBy` object MUST be a substring of the derived dataset folder name.
+That is, in a directory `<dataset>/derivatives/<pipeline>[-<variant>]/`, the first
+`GeneratedBy` object should have a `Name` of `<pipeline>`.
 
 Example:
 
@@ -120,11 +141,6 @@ Example:
 }
 ```
 
-If a derived dataset is stored as a subfolder of the raw dataset, then the `Name` field
-of the first `GeneratedBy` object MUST be a substring of the derived dataset folder name.
-That is, in a directory `<dataset>/derivatives/<pipeline>[-<variant>]/`, the first
-`GeneratedBy` object should have a `Name` of `<pipeline>`.
-
 ### `README`
 
 Every BIDS dataset SHOULD come with a free form text file (`README`) describing the dataset in more detail.
diff --git a/src/schema/objects/metadata.yaml b/src/schema/objects/metadata.yaml
index a8e572a878..1a7e44c271 100644
--- a/src/schema/objects/metadata.yaml
+++ b/src/schema/objects/metadata.yaml
@@ -866,7 +866,7 @@ Funding:
 GeneratedBy:
   name: GeneratedBy
   description: |
-    Used to specify provenance of the derived dataset.
+    Used to specify provenance of the dataset.
     See table below for contents of each object.
   type: array
   minItems: 1