[ENH] Add Sources link to all raw datatypes sidecar JSON #906

adam2392 · 2021-10-21T01:05:29Z

Fixes: #905

Add Sources to the sidecar json that point to the original source file.

Remi-Gau

I suspect this should also apply to all datatypes:

mri
pet
physio
beh

Remi-Gau · 2021-10-21T06:08:22Z

Also because this refers to source data, should we mention that file names should not contain sensitive information like say the participant name?

sappelhoff · 2021-10-21T10:36:15Z

I suspect this should also apply to all datatypes:

agreed

Also because this refers to source data, should we mention that file names should not contain sensitive information like say the participant name?

I think we can make a note of this, to make users aware of anonymization implications - but I don't think we need to make a MAY, SHOULD, MUST thing out of this

effigies · 2021-10-21T12:40:42Z

I would just use Sources. My inclination, with [ENH] Major update to pointing to files within, outside of, or remote to a current BIDS dataset #820, is to deprecate RawSources, and have a single Sources list that can contain any URI.
I think this has to be an OPTIONAL, as there's no expectation that there's a linkable reference for raw data. SHOULD is supposed to come with validator warnings, and this would be very noisy.
And agree that this should apply to all data files. I would be inclined to put it into common principles as a generally applicable piece of metadata, though this is something we don't currently have, so that may be a different conversation...

adam2392 · 2021-10-21T13:59:36Z

I would just use Sources. My inclination, with [ENH] Major update to pointing to files within, outside of, or remote to a current BIDS dataset #820, is to deprecate RawSources, and have a single Sources list that can contain any URI.

Done.

I think this has to be an OPTIONAL, as there's no expectation that there's a linkable reference for raw data. SHOULD is supposed to come with validator warnings, and this would be very noisy.

Done.

And agree that this should apply to all data files. I would be inclined to put it into common principles as a generally applicable piece of metadata, though this is something we don't currently have, so that may be a different conversation...

Yes perhaps this would be a different issue altogether?

Remi-Gau · 2021-10-22T06:51:05Z

Yes perhaps this would be a different issue altogether?

Fine with me. We can add this to the list of things that should be "refactored" in the spec.

src/CHANGES.md

Remi-Gau

Looks good to me.

I suspect we can add the 'anonymity warning' mentioned in the discussion when we refactor this into common principles.

guiomar · 2021-10-22T07:37:31Z

Thnaks Remi for pointing to this one.

I don't think I fully agree with the proposed change. If a dataset is self contained, what would this new field provide as an additional information about the data? A path to a file that you don't want to use anymore because is has already been converted? I do see the need of having such an information to track the data, but could be in a conversion table outside of the curated dataset, eg. /sourcedata. That one could keep as a useful resource for the lab logistics, but not necessarily share and distribute along the data.

Maybe I am missing some use cases where its important to have this inside each data json file

Remi-Gau · 2021-10-22T08:40:33Z

Should we ping some of the people involved in "provenance" issues / BEP to get more inputs on this ?

tsalo · 2021-10-22T13:59:13Z

Maybe I am missing some use cases where its important to have this inside each data json file

@guiomar the big one I'm thinking of is BIDS-MEGA (#880). They'll be working with a lot of non-compliant datasets. Most of them will be derivative datasets where Sources is already allowed, but I believe BEP team did bring up working with shared raw data, and they want to ensure that the mapping from the original files to the BIDS-compliant copies is clear.

adam2392 · 2021-10-22T18:18:04Z

So what I'm hearing is this feature should either be:

in each sidecar JSON file as an OPTIONAL
or in some JSON/TSV mapping inside sourcedata/

For sidecar JSONs, it can be stored in the Sources: key. But if it was a running list in sourcedata/, what would it look like?

say... a tsv with two columns:

filename | Sources
-------------------
         |

?

sappelhoff

For sidecar JSONs, it can be stored in the Sources: key. But if it was a running list in sourcedata/, what would it look like?

I find making use of Sources more elegant and minimalist than adding rules about sourcedata/

sappelhoff · 2021-10-26T13:05:05Z

src/04-modality-specific-files/01-magnetic-resonance-imaging-data.md

@@ -562,7 +566,8 @@ Example:
   "InstitutionName": "Stanford University",
   "InstitutionAddress": "450 Serra Mall, Stanford, CA 94305-2004, USA",
   "DeviceSerialNumber": "11035",
-   "B0FieldSource": ["phasediff_fmap0", "pepolar_fmap0"]
+   "B0FieldSource": ["phasediff_fmap0", "pepolar_fmap0"],
+   "Sources": ["dicom01", "dicom02", ..., "dicom150"]


This "breaks" the JSON syntax, see screenshot:

Maybe this is clear enough without breaking the syntax

Suggested change

"Sources": ["dicom01", "dicom02", ..., "dicom150"]

"Sources": ["dicom01", "dicom02", "...", "dicom150"]

I really don't think we should include this before #820, so that we don't introduce yet another ambiguous reference method. And then, this is an example, we can make the list short without ellipses:

Suggested change

"Sources": ["dicom01", "dicom02", ..., "dicom150"]

"Sources": ["bids::/sourcedata/dicom01", "bids::/sourcedata/dicom02"]

This is not necessary anymore under the pattern of the MRI metadata descriptions.

I would also think that we can definitely merge this in first before #820 where #820 tells us the different ways of linking datasets, but this PR is meant to just close the gap between BIDS converted files and their original source files.

effigies · 2021-10-26T13:10:06Z

src/04-modality-specific-files/01-magnetic-resonance-imaging-data.md

@@ -562,7 +566,8 @@ Example:
   "InstitutionName": "Stanford University",
   "InstitutionAddress": "450 Serra Mall, Stanford, CA 94305-2004, USA",
   "DeviceSerialNumber": "11035",
-   "B0FieldSource": ["phasediff_fmap0", "pepolar_fmap0"]
+   "B0FieldSource": ["phasediff_fmap0", "pepolar_fmap0"],
+   "Sources": ["dicom01", "dicom02", ..., "dicom150"]


I really don't think we should include this before #820, so that we don't introduce yet another ambiguous reference method. And then, this is an example, we can make the list short without ellipses:

Suggested change

"Sources": ["dicom01", "dicom02", ..., "dicom150"]

"Sources": ["bids::/sourcedata/dicom01", "bids::/sourcedata/dicom02"]

effigies · 2021-10-26T13:11:35Z

src/04-modality-specific-files/01-magnetic-resonance-imaging-data.md

@@ -492,6 +492,10 @@ combined image rather than an image from each coil.

 ### Other RECOMMENDED metadata

+#### Source Filename


I would put this under common metadata, not "Task imaging data".

effigies · 2021-10-26T14:13:14Z

src/04-modality-specific-files/01-magnetic-resonance-imaging-data.md

@@ -492,6 +492,10 @@ combined image rather than an image from each coil.

 ### Other RECOMMENDED metadata

+#### Source Filename
+
+Similar to [derivatives](../05-derivatives/02-common-data-types.md), it is OPTIONAL to include ``Sources`` as a key in the sidecar JSON, specifying the filename(s) of the source file used to generate this dataset. If the filename(s) contains patient identifiable information, then it should not be stored in ``Sources``.


I would suggest that we do this as a normal metadata field:

Key name Requirement level Data type Description

Sources OPTIONAL array of strings URI of source file used to generate the current file. Care should be taken not to leak patient identifiable information for publicly shared datasets.

Done and added this table to the rest of the modalities too

effigies · 2021-10-26T14:16:14Z

or in some JSON/TSV mapping inside sourcedata/

For sidecar JSONs, it can be stored in the Sources: key. But if it was a running list in sourcedata/, what would it look like?

say... a tsv with two columns:
filename | Sources
-------------------
         |

We should not specify anything in /sourcedata or /code. These have been specifically set aside for non-BIDS resources that are helpful for understanding/reconstructing the provenance of the dataset. It would be up to a tool that uses such a TSV to decide how to format it.

src/CHANGES.md

adam2392 · 2021-10-26T16:12:20Z

This is good to go on my end. I think it's a clean addition of a common principle (i.e. "Sources") that is already present in derivatives and can help in creating more robust datasets preventing accidental BIDS conversion of the same dataset to different file names.

Re anonymization concerns: I think it's already considered bad practice to have identifying information within the source filenames, so generally that shouldn't even occur. On the other hand, if the source filenames are clean of PHI, then it is desirable to have an OPTIONAL backwards trace for your dataset, especially during the initial analysis.

However, this issue is also present if you create derivatives/ of a dataset. There is no guarantee the naming system of the BIDS dataset itself is devoid of PHI in the filenames. Thus, derivatives themselves are also at risk. I don't see an easy way around the incorrect file naming for both sourcedata and derivatives that stems from sourcedata until some automated NLP solution comes about.

adam2392 · 2021-10-28T18:19:21Z

I was thinking this through again, and I wonder if perhaps adding it as an OPTIONAL column to the scans.tsv is not more appropriate? I realize that scans is already an OPTIONAL file, but so is Sources in these sidecar JSONs.

The reason that scans seems more appropriate is that, when writing files, one can simply check the scans.tsv to see if the Sources column already has the source file that is to be written. If it does and the new BIDS file path does not match what is already written, then an error should be raised because that means one is writing two files from the same source file.

This also simplifies deleting the entire Sources column, because you only need to do it in the scans.tsv for every subject/session rather then for every sidecar JSON. Moreover, this would be modality agnostic and also make BIDS more thoughtful of the downstream implementation.

Thoughts?

sappelhoff · 2021-11-01T14:18:22Z

Thoughts?

technically, nothing is preventing you to already add a sources column to scans.tsv and to document it in scans.json. If you want to make this a standardized "optional" column, I think that'd be fine. It'd be similar to how we recommend a "handedness" column in participants.tsv, and suggest some values to put there (source), only that a sources column would be optional.

Not sure whether that solution would be preferable to extending the derivative Sources to raw BIDS data. We could also do both. To me, both changes feel like they would not destroy anything and/or make anything harder downstream, so personally I am open to whatever you think solves your problem best.

adam2392 · 2021-11-01T18:24:12Z

I think adding to scans.tsv is a good addition too. I'll modify it to include both possibilities. I think downstream in MNE-BIDS for example, we would just write it in scans.tsv and check there for existing files when writing.

This is good to go for me, unless there are other objections.

adam2392 · 2021-11-04T16:56:14Z

@effigies, @sappelhoff just pinging to see if there's any other issues here for me to address, so I don't forget it :p. Thanks!

adam2392 · 2021-11-04T17:07:29Z

Not sure why link checker is breaking.

sappelhoff · 2021-11-04T17:24:42Z

Not sure why link checker is breaking.

see #910 - nothing you need to fix here :-)

just pinging to see if there's any other issues here for me to address, so I don't forget it :p. Thanks!

will take a look soon!

sappelhoff

Having read about the Sources field in JSON files again (see Sources), I think that a simple sources column in scans.tsv may be a better solution, see my review comment.

I think the Sources field would be a bit unwieldy for the goal you have ... it'd mean having to create a JSON sidecar for each file you want to describe a source of 🤔 That'd quickly become messy when you have lots of task and run entities.

thus I suggest to remove all "Sources" JSON references. But @adam2392 please hold off any more changes, I plan to discuss this with the other maintainers tomorrow evening, so my opinion might change :-) (don't want you to do more work than necessary)

sappelhoff · 2021-11-08T13:18:17Z

src/03-modality-agnostic-files.md

@@ -452,6 +452,8 @@ For example, the [EDF](https://www.edfplus.info/)
 data format can only contain recording dates after 1985.
 Shifting dates is RECOMMENDED, but not required.

+The source file path of each recording is also OPTIONAL and can be stored under the ``sources`` column. Similar to [derivatives](./05-derivatives/02-common-data-types.md), it is OPTIONAL to include ``sources``, specifying the filename(s) of the source file used to generate this dataset. If the filename(s) contains patient identifiable information, then it should not be stored.


double backticks render code in RST, but for MD, it's single backticks 😉

Also: the files under the filename column are relative to the scans.tsv file (this is implicit unfortunately, but relatively clear from the context that this is meant by the text on line 435). --> What should the files under the sources column be relative to? ... should we wait for #918 and make this BIDS-URIs? ... or should it be relative to the dataset root - as you seem to do in your current example?

adam2392 · 2021-11-10T15:36:23Z

Having read about the Sources field in JSON files again (see Sources), I think that a simple sources column in scans.tsv may be a better solution, see my review comment.

Sounds good. I agree here.

thus I suggest to remove all "Sources" JSON references. But @adam2392 please hold off any more changes, I plan to discuss this with the other maintainers tomorrow evening, so my opinion might change :-) (don't want you to do more work than necessary)

Okay, any update here? Thanks!

sappelhoff · 2021-11-11T21:26:38Z

Okay, any update here? Thanks!

After some discussion, the consensus was that this problem at this point is too little a user-issue to warrant making some sources column a "first-class-citizen" in the BIDS spec, when instead, users can just add a sources column themselves and document it in their JSON file.

In the original issue, you make these two arguments:

... many times you'll have new datasets coming in, or maybe you want to determine if the file you uploaded with some filename (e.g. subject001_eeg_001.edf) was converted or not.

This is something you can easily fix and track in your conversion scripts, I don't think this needs to be done in BIDS.

Moreover, many of my collaborators (i.e. clinicians) only remember their original file naming scheme, not the organized BIDS files. Unfortunately, then there's a lot of back and forth about which file is which unless there is a backwards trace of which BIDS file corresponds to which source file. There is no easy way to check this right now.

You are the first to bring up this issue (thanks for that) - but it's arguably a relatively small issue (and apparently not widespread), and I pointed out the solution in my first few sentences above (using arbitrary columns).

@adam2392 I think we could add support for this in mne-bids without having to have this explicitly in the spec.

Overall, I suggest closing the issue and this PR and to solve your problems from the tooling side.

adam2392 · 2021-11-11T21:59:51Z

Sounds good to me if we can PR mne-bids.

adam2392 requested review from sappelhoff, tsalo and effigies October 21, 2021 01:05

adam2392 mentioned this pull request Oct 21, 2021

[ENH] Include RawSources in sidecar JSON files mne-tools/mne-bids#887

Closed

Remi-Gau requested changes Oct 21, 2021

View reviewed changes

adam2392 requested a review from chrisgorgo as a code owner October 21, 2021 13:59

adam2392 requested a review from Remi-Gau October 21, 2021 15:18

adam2392 changed the title ~~[ENH] Add RawSources link to MEG/EEG/iEEG sidecar JSON~~ [ENH] Add Sources link to all raw datatypes sidecar JSON Oct 21, 2021

adam2392 requested a review from CPernet as a code owner October 21, 2021 17:25

adam2392 mentioned this pull request Oct 21, 2021

[MRG] Adds Sources for scans.tsv files mne-tools/mne-bids#890

Merged

6 tasks

Remi-Gau mentioned this pull request Oct 22, 2021

Adding sourcedata filename to a column in the scans.tsv file #905

Closed

Remi-Gau reviewed Oct 22, 2021

View reviewed changes

src/CHANGES.md Outdated Show resolved Hide resolved

Remi-Gau approved these changes Oct 22, 2021

View reviewed changes

sappelhoff reviewed Oct 26, 2021

View reviewed changes

effigies reviewed Oct 26, 2021

View reviewed changes

adam2392 mentioned this pull request Oct 26, 2021

Add Sources key for sidecar JSON bids-standard/legacy-validator#1373

Closed

adam2392 added 4 commits October 26, 2021 11:58

Add RawSources

534053d

Adding to all datatypes

bf54d1f

Adding to all datatypes

5e0f0eb

Add changelog

7cd3668

adam2392 added 3 commits October 26, 2021 11:58

Adding updated array

7160c56

Fix CHANGE

a68699c

Adding table for clarification

047deb2

adam2392 force-pushed the sources branch from 4f8d53b to 047deb2 Compare October 26, 2021 16:00

adam2392 commented Oct 26, 2021

View reviewed changes

src/CHANGES.md Outdated Show resolved Hide resolved

Update src/CHANGES.md

a4cd06e

adam2392 requested review from effigies and sappelhoff October 28, 2021 18:14

Adding to scans

ec898a3

Merge branch 'master' into sources

d9b5123

sappelhoff reviewed Nov 8, 2021

View reviewed changes

adam2392 closed this Nov 11, 2021

adam2392 mentioned this pull request Nov 11, 2021

Adding source filename/path in scans.tsv mne-tools/mne-bids#905

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Add Sources link to all raw datatypes sidecar JSON #906

[ENH] Add Sources link to all raw datatypes sidecar JSON #906

adam2392 commented Oct 21, 2021 •

edited

Loading

Remi-Gau left a comment

Remi-Gau commented Oct 21, 2021

sappelhoff commented Oct 21, 2021

effigies commented Oct 21, 2021

adam2392 commented Oct 21, 2021

Remi-Gau commented Oct 22, 2021

Remi-Gau left a comment

guiomar commented Oct 22, 2021

Remi-Gau commented Oct 22, 2021

tsalo commented Oct 22, 2021

adam2392 commented Oct 22, 2021 •

edited by sappelhoff

Loading

sappelhoff left a comment

sappelhoff Oct 26, 2021

effigies Oct 26, 2021

adam2392 Oct 26, 2021

adam2392 Oct 26, 2021

effigies Oct 26, 2021

effigies Oct 26, 2021

adam2392 Oct 26, 2021

effigies Oct 26, 2021

adam2392 Oct 26, 2021

effigies commented Oct 26, 2021

adam2392 commented Oct 26, 2021

adam2392 commented Oct 28, 2021

sappelhoff commented Nov 1, 2021

adam2392 commented Nov 1, 2021 •

edited

Loading

adam2392 commented Nov 4, 2021

adam2392 commented Nov 4, 2021

sappelhoff commented Nov 4, 2021

sappelhoff left a comment

sappelhoff Nov 8, 2021

adam2392 commented Nov 10, 2021

sappelhoff commented Nov 11, 2021

adam2392 commented Nov 11, 2021

	"Sources": ["dicom01", "dicom02", ..., "dicom150"]
	"Sources": ["dicom01", "dicom02", "...", "dicom150"]

	"Sources": ["dicom01", "dicom02", ..., "dicom150"]
	"Sources": ["bids::/sourcedata/dicom01", "bids::/sourcedata/dicom02"]

		@@ -492,6 +492,10 @@ combined image rather than an image from each coil.

		### Other RECOMMENDED metadata

		#### Source Filename

[ENH] Add Sources link to all raw datatypes sidecar JSON #906

[ENH] Add Sources link to all raw datatypes sidecar JSON #906

Conversation

adam2392 commented Oct 21, 2021 • edited Loading

Remi-Gau left a comment

Choose a reason for hiding this comment

Remi-Gau commented Oct 21, 2021

sappelhoff commented Oct 21, 2021

effigies commented Oct 21, 2021

adam2392 commented Oct 21, 2021

Remi-Gau commented Oct 22, 2021

Remi-Gau left a comment

Choose a reason for hiding this comment

guiomar commented Oct 22, 2021

Remi-Gau commented Oct 22, 2021

tsalo commented Oct 22, 2021

adam2392 commented Oct 22, 2021 • edited by sappelhoff Loading

sappelhoff left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

effigies commented Oct 26, 2021

adam2392 commented Oct 26, 2021

adam2392 commented Oct 28, 2021

sappelhoff commented Nov 1, 2021

adam2392 commented Nov 1, 2021 • edited Loading

adam2392 commented Nov 4, 2021

adam2392 commented Nov 4, 2021

sappelhoff commented Nov 4, 2021

sappelhoff left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adam2392 commented Nov 10, 2021

sappelhoff commented Nov 11, 2021

adam2392 commented Nov 11, 2021

adam2392 commented Oct 21, 2021 •

edited

Loading

adam2392 commented Oct 22, 2021 •

edited by sappelhoff

Loading

adam2392 commented Nov 1, 2021 •

edited

Loading