Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Add Sources link to all raw datatypes sidecar JSON #906

Closed
wants to merge 10 commits into from

Conversation

adam2392
Copy link
Member

@adam2392 adam2392 commented Oct 21, 2021

Fixes: #905

Add Sources to the sidecar json that point to the original source file.

Copy link
Collaborator

@Remi-Gau Remi-Gau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this should also apply to all datatypes:

  • mri
  • pet
  • physio
  • beh

@Remi-Gau
Copy link
Collaborator

Also because this refers to source data, should we mention that file names should not contain sensitive information like say the participant name?

@sappelhoff
Copy link
Member

I suspect this should also apply to all datatypes:

agreed

Also because this refers to source data, should we mention that file names should not contain sensitive information like say the participant name?

I think we can make a note of this, to make users aware of anonymization implications - but I don't think we need to make a MAY, SHOULD, MUST thing out of this

@effigies
Copy link
Collaborator

  • I would just use Sources. My inclination, with [ENH] Major update to pointing to files within, outside of, or remote to a current BIDS dataset #820, is to deprecate RawSources, and have a single Sources list that can contain any URI.
  • I think this has to be an OPTIONAL, as there's no expectation that there's a linkable reference for raw data. SHOULD is supposed to come with validator warnings, and this would be very noisy.
  • And agree that this should apply to all data files. I would be inclined to put it into common principles as a generally applicable piece of metadata, though this is something we don't currently have, so that may be a different conversation...

@adam2392
Copy link
Member Author

Done.

  • I think this has to be an OPTIONAL, as there's no expectation that there's a linkable reference for raw data. SHOULD is supposed to come with validator warnings, and this would be very noisy.

Done.

  • And agree that this should apply to all data files. I would be inclined to put it into common principles as a generally applicable piece of metadata, though this is something we don't currently have, so that may be a different conversation...

Yes perhaps this would be a different issue altogether?

@adam2392 adam2392 changed the title [ENH] Add RawSources link to MEG/EEG/iEEG sidecar JSON [ENH] Add Sources link to all raw datatypes sidecar JSON Oct 21, 2021
@Remi-Gau
Copy link
Collaborator

Yes perhaps this would be a different issue altogether?

Fine with me. We can add this to the list of things that should be "refactored" in the spec.

src/CHANGES.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@Remi-Gau Remi-Gau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

I suspect we can add the 'anonymity warning' mentioned in the discussion when we refactor this into common principles.

@guiomar
Copy link
Collaborator

guiomar commented Oct 22, 2021

Thnaks Remi for pointing to this one.

I don't think I fully agree with the proposed change. If a dataset is self contained, what would this new field provide as an additional information about the data? A path to a file that you don't want to use anymore because is has already been converted? I do see the need of having such an information to track the data, but could be in a conversion table outside of the curated dataset, eg. /sourcedata. That one could keep as a useful resource for the lab logistics, but not necessarily share and distribute along the data.

Maybe I am missing some use cases where its important to have this inside each data json file

@Remi-Gau
Copy link
Collaborator

Should we ping some of the people involved in "provenance" issues / BEP to get more inputs on this ?

@tsalo
Copy link
Member

tsalo commented Oct 22, 2021

Maybe I am missing some use cases where its important to have this inside each data json file

@guiomar the big one I'm thinking of is BIDS-MEGA (#880). They'll be working with a lot of non-compliant datasets. Most of them will be derivative datasets where Sources is already allowed, but I believe BEP team did bring up working with shared raw data, and they want to ensure that the mapping from the original files to the BIDS-compliant copies is clear.

@adam2392
Copy link
Member Author

adam2392 commented Oct 22, 2021

So what I'm hearing is this feature should either be:

  1. in each sidecar JSON file as an OPTIONAL
  2. or in some JSON/TSV mapping inside sourcedata/

For sidecar JSONs, it can be stored in the Sources: key. But if it was a running list in sourcedata/, what would it look like?

say... a tsv with two columns:

filename | Sources
-------------------
         |

?

Copy link
Member

@sappelhoff sappelhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sidecar JSONs, it can be stored in the Sources: key. But if it was a running list in sourcedata/, what would it look like?

I find making use of Sources more elegant and minimalist than adding rules about sourcedata/

@@ -562,7 +566,8 @@ Example:
"InstitutionName": "Stanford University",
"InstitutionAddress": "450 Serra Mall, Stanford, CA 94305-2004, USA",
"DeviceSerialNumber": "11035",
"B0FieldSource": ["phasediff_fmap0", "pepolar_fmap0"]
"B0FieldSource": ["phasediff_fmap0", "pepolar_fmap0"],
"Sources": ["dicom01", "dicom02", ..., "dicom150"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "breaks" the JSON syntax, see screenshot:

image

Maybe this is clear enough without breaking the syntax

Suggested change
"Sources": ["dicom01", "dicom02", ..., "dicom150"]
"Sources": ["dicom01", "dicom02", "...", "dicom150"]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't think we should include this before #820, so that we don't introduce yet another ambiguous reference method. And then, this is an example, we can make the list short without ellipses:

Suggested change
"Sources": ["dicom01", "dicom02", ..., "dicom150"]
"Sources": ["bids::/sourcedata/dicom01", "bids::/sourcedata/dicom02"]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary anymore under the pattern of the MRI metadata descriptions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also think that we can definitely merge this in first before #820 where #820 tells us the different ways of linking datasets, but this PR is meant to just close the gap between BIDS converted files and their original source files.

@@ -562,7 +566,8 @@ Example:
"InstitutionName": "Stanford University",
"InstitutionAddress": "450 Serra Mall, Stanford, CA 94305-2004, USA",
"DeviceSerialNumber": "11035",
"B0FieldSource": ["phasediff_fmap0", "pepolar_fmap0"]
"B0FieldSource": ["phasediff_fmap0", "pepolar_fmap0"],
"Sources": ["dicom01", "dicom02", ..., "dicom150"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't think we should include this before #820, so that we don't introduce yet another ambiguous reference method. And then, this is an example, we can make the list short without ellipses:

Suggested change
"Sources": ["dicom01", "dicom02", ..., "dicom150"]
"Sources": ["bids::/sourcedata/dicom01", "bids::/sourcedata/dicom02"]

@@ -492,6 +492,10 @@ combined image rather than an image from each coil.

### Other RECOMMENDED metadata

#### Source Filename
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put this under common metadata, not "Task imaging data".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -492,6 +492,10 @@ combined image rather than an image from each coil.

### Other RECOMMENDED metadata

#### Source Filename

Similar to [derivatives](../05-derivatives/02-common-data-types.md), it is OPTIONAL to include ``Sources`` as a key in the sidecar JSON, specifying the filename(s) of the source file used to generate this dataset. If the filename(s) contains patient identifiable information, then it should not be stored in ``Sources``.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest that we do this as a normal metadata field:

Key name Requirement level Data type Description
Sources OPTIONAL array of strings URI of source file used to generate the current file. Care should be taken not to leak patient identifiable information for publicly shared datasets.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and added this table to the rest of the modalities too

@effigies
Copy link
Collaborator

  1. or in some JSON/TSV mapping inside sourcedata/

For sidecar JSONs, it can be stored in the Sources: key. But if it was a running list in sourcedata/, what would it look like?

say... a tsv with two columns:

filename | Sources
-------------------
         |

We should not specify anything in /sourcedata or /code. These have been specifically set aside for non-BIDS resources that are helpful for understanding/reconstructing the provenance of the dataset. It would be up to a tool that uses such a TSV to decide how to format it.

src/CHANGES.md Outdated Show resolved Hide resolved
@adam2392
Copy link
Member Author

This is good to go on my end. I think it's a clean addition of a common principle (i.e. "Sources") that is already present in derivatives and can help in creating more robust datasets preventing accidental BIDS conversion of the same dataset to different file names.

Re anonymization concerns: I think it's already considered bad practice to have identifying information within the source filenames, so generally that shouldn't even occur. On the other hand, if the source filenames are clean of PHI, then it is desirable to have an OPTIONAL backwards trace for your dataset, especially during the initial analysis.

However, this issue is also present if you create derivatives/ of a dataset. There is no guarantee the naming system of the BIDS dataset itself is devoid of PHI in the filenames. Thus, derivatives themselves are also at risk. I don't see an easy way around the incorrect file naming for both sourcedata and derivatives that stems from sourcedata until some automated NLP solution comes about.

@adam2392
Copy link
Member Author

I was thinking this through again, and I wonder if perhaps adding it as an OPTIONAL column to the scans.tsv is not more appropriate? I realize that scans is already an OPTIONAL file, but so is Sources in these sidecar JSONs.

The reason that scans seems more appropriate is that, when writing files, one can simply check the scans.tsv to see if the Sources column already has the source file that is to be written. If it does and the new BIDS file path does not match what is already written, then an error should be raised because that means one is writing two files from the same source file.

This also simplifies deleting the entire Sources column, because you only need to do it in the scans.tsv for every subject/session rather then for every sidecar JSON. Moreover, this would be modality agnostic and also make BIDS more thoughtful of the downstream implementation.

Thoughts?

@sappelhoff
Copy link
Member

Thoughts?

technically, nothing is preventing you to already add a sources column to scans.tsv and to document it in scans.json. If you want to make this a standardized "optional" column, I think that'd be fine. It'd be similar to how we recommend a "handedness" column in participants.tsv, and suggest some values to put there (source), only that a sources column would be optional.

Not sure whether that solution would be preferable to extending the derivative Sources to raw BIDS data. We could also do both. To me, both changes feel like they would not destroy anything and/or make anything harder downstream, so personally I am open to whatever you think solves your problem best.

@adam2392
Copy link
Member Author

adam2392 commented Nov 1, 2021

I think adding to scans.tsv is a good addition too. I'll modify it to include both possibilities. I think downstream in MNE-BIDS for example, we would just write it in scans.tsv and check there for existing files when writing.

This is good to go for me, unless there are other objections.

@adam2392
Copy link
Member Author

adam2392 commented Nov 4, 2021

@effigies, @sappelhoff just pinging to see if there's any other issues here for me to address, so I don't forget it :p. Thanks!

@adam2392
Copy link
Member Author

adam2392 commented Nov 4, 2021

Not sure why link checker is breaking.

@sappelhoff
Copy link
Member

Not sure why link checker is breaking.

see #910 - nothing you need to fix here :-)

just pinging to see if there's any other issues here for me to address, so I don't forget it :p. Thanks!

will take a look soon!

Copy link
Member

@sappelhoff sappelhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having read about the Sources field in JSON files again (see Sources), I think that a simple sources column in scans.tsv may be a better solution, see my review comment.

I think the Sources field would be a bit unwieldy for the goal you have ... it'd mean having to create a JSON sidecar for each file you want to describe a source of 🤔 That'd quickly become messy when you have lots of task and run entities.

thus I suggest to remove all "Sources" JSON references. But @adam2392 please hold off any more changes, I plan to discuss this with the other maintainers tomorrow evening, so my opinion might change :-) (don't want you to do more work than necessary)

@@ -452,6 +452,8 @@ For example, the [EDF](https://www.edfplus.info/)
data format can only contain recording dates after 1985.
Shifting dates is RECOMMENDED, but not required.

The source file path of each recording is also OPTIONAL and can be stored under the ``sources`` column. Similar to [derivatives](./05-derivatives/02-common-data-types.md), it is OPTIONAL to include ``sources``, specifying the filename(s) of the source file used to generate this dataset. If the filename(s) contains patient identifiable information, then it should not be stored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double backticks render code in RST, but for MD, it's single backticks 😉

Also: the files under the filename column are relative to the scans.tsv file (this is implicit unfortunately, but relatively clear from the context that this is meant by the text on line 435). --> What should the files under the sources column be relative to? ... should we wait for #918 and make this BIDS-URIs? ... or should it be relative to the dataset root - as you seem to do in your current example?

@adam2392
Copy link
Member Author

Having read about the Sources field in JSON files again (see Sources), I think that a simple sources column in scans.tsv may be a better solution, see my review comment.

Sounds good. I agree here.

thus I suggest to remove all "Sources" JSON references. But @adam2392 please hold off any more changes, I plan to discuss this with the other maintainers tomorrow evening, so my opinion might change :-) (don't want you to do more work than necessary)

Okay, any update here? Thanks!

@sappelhoff
Copy link
Member

Okay, any update here? Thanks!

After some discussion, the consensus was that this problem at this point is too little a user-issue to warrant making some sources column a "first-class-citizen" in the BIDS spec, when instead, users can just add a sources column themselves and document it in their JSON file.

In the original issue, you make these two arguments:

... many times you'll have new datasets coming in, or maybe you want to determine if the file you uploaded with some filename (e.g. subject001_eeg_001.edf) was converted or not.

This is something you can easily fix and track in your conversion scripts, I don't think this needs to be done in BIDS.

Moreover, many of my collaborators (i.e. clinicians) only remember their original file naming scheme, not the organized BIDS files. Unfortunately, then there's a lot of back and forth about which file is which unless there is a backwards trace of which BIDS file corresponds to which source file. There is no easy way to check this right now.

You are the first to bring up this issue (thanks for that) - but it's arguably a relatively small issue (and apparently not widespread), and I pointed out the solution in my first few sentences above (using arbitrary columns).


@adam2392 I think we could add support for this in mne-bids without having to have this explicitly in the spec.

Overall, I suggest closing the issue and this PR and to solve your problems from the tooling side.

@adam2392
Copy link
Member Author

Sounds good to me if we can PR mne-bids.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding sourcedata filename to a column in the scans.tsv file
6 participants