-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Add Sources link to all raw datatypes sidecar JSON #906
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this should also apply to all datatypes:
- mri
- pet
- physio
- beh
Also because this refers to source data, should we mention that file names should not contain sensitive information like say the participant name? |
agreed
I think we can make a note of this, to make users aware of anonymization implications - but I don't think we need to make a MAY, SHOULD, MUST thing out of this |
|
Done.
Done.
Yes perhaps this would be a different issue altogether? |
Fine with me. We can add this to the list of things that should be "refactored" in the spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
I suspect we can add the 'anonymity warning' mentioned in the discussion when we refactor this into common principles.
Thnaks Remi for pointing to this one. I don't think I fully agree with the proposed change. If a dataset is self contained, what would this new field provide as an additional information about the data? A path to a file that you don't want to use anymore because is has already been converted? I do see the need of having such an information to track the data, but could be in a conversion table outside of the curated dataset, eg. /sourcedata. That one could keep as a useful resource for the lab logistics, but not necessarily share and distribute along the data. Maybe I am missing some use cases where its important to have this inside each data json file |
Should we ping some of the people involved in "provenance" issues / BEP to get more inputs on this ? |
@guiomar the big one I'm thinking of is BIDS-MEGA (#880). They'll be working with a lot of non-compliant datasets. Most of them will be derivative datasets where Sources is already allowed, but I believe BEP team did bring up working with shared raw data, and they want to ensure that the mapping from the original files to the BIDS-compliant copies is clear. |
So what I'm hearing is this feature should either be:
For sidecar JSONs, it can be stored in the say... a tsv with two columns:
? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sidecar JSONs, it can be stored in the Sources: key. But if it was a running list in sourcedata/, what would it look like?
I find making use of Sources
more elegant and minimalist than adding rules about sourcedata/
@@ -562,7 +566,8 @@ Example: | |||
"InstitutionName": "Stanford University", | |||
"InstitutionAddress": "450 Serra Mall, Stanford, CA 94305-2004, USA", | |||
"DeviceSerialNumber": "11035", | |||
"B0FieldSource": ["phasediff_fmap0", "pepolar_fmap0"] | |||
"B0FieldSource": ["phasediff_fmap0", "pepolar_fmap0"], | |||
"Sources": ["dicom01", "dicom02", ..., "dicom150"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't think we should include this before #820, so that we don't introduce yet another ambiguous reference method. And then, this is an example, we can make the list short without ellipses:
"Sources": ["dicom01", "dicom02", ..., "dicom150"] | |
"Sources": ["bids::/sourcedata/dicom01", "bids::/sourcedata/dicom02"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not necessary anymore under the pattern of the MRI metadata descriptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -562,7 +566,8 @@ Example: | |||
"InstitutionName": "Stanford University", | |||
"InstitutionAddress": "450 Serra Mall, Stanford, CA 94305-2004, USA", | |||
"DeviceSerialNumber": "11035", | |||
"B0FieldSource": ["phasediff_fmap0", "pepolar_fmap0"] | |||
"B0FieldSource": ["phasediff_fmap0", "pepolar_fmap0"], | |||
"Sources": ["dicom01", "dicom02", ..., "dicom150"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't think we should include this before #820, so that we don't introduce yet another ambiguous reference method. And then, this is an example, we can make the list short without ellipses:
"Sources": ["dicom01", "dicom02", ..., "dicom150"] | |
"Sources": ["bids::/sourcedata/dicom01", "bids::/sourcedata/dicom02"] |
@@ -492,6 +492,10 @@ combined image rather than an image from each coil. | |||
|
|||
### Other RECOMMENDED metadata | |||
|
|||
#### Source Filename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put this under common metadata, not "Task imaging data".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -492,6 +492,10 @@ combined image rather than an image from each coil. | |||
|
|||
### Other RECOMMENDED metadata | |||
|
|||
#### Source Filename | |||
|
|||
Similar to [derivatives](../05-derivatives/02-common-data-types.md), it is OPTIONAL to include ``Sources`` as a key in the sidecar JSON, specifying the filename(s) of the source file used to generate this dataset. If the filename(s) contains patient identifiable information, then it should not be stored in ``Sources``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest that we do this as a normal metadata field:
Key name | Requirement level | Data type | Description |
---|---|---|---|
Sources | OPTIONAL | array of strings | URI of source file used to generate the current file. Care should be taken not to leak patient identifiable information for publicly shared datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done and added this table to the rest of the modalities too
We should not specify anything in |
This is good to go on my end. I think it's a clean addition of a common principle (i.e. "Sources") that is already present in derivatives and can help in creating more robust datasets preventing accidental BIDS conversion of the same dataset to different file names. Re anonymization concerns: I think it's already considered bad practice to have identifying information within the source filenames, so generally that shouldn't even occur. On the other hand, if the source filenames are clean of PHI, then it is desirable to have an OPTIONAL backwards trace for your dataset, especially during the initial analysis. However, this issue is also present if you create |
I was thinking this through again, and I wonder if perhaps adding it as an OPTIONAL column to the The reason that scans seems more appropriate is that, when writing files, one can simply check the This also simplifies deleting the entire Thoughts? |
technically, nothing is preventing you to already add a Not sure whether that solution would be preferable to extending the derivative |
I think adding to This is good to go for me, unless there are other objections. |
@effigies, @sappelhoff just pinging to see if there's any other issues here for me to address, so I don't forget it :p. Thanks! |
Not sure why link checker is breaking. |
see #910 - nothing you need to fix here :-)
will take a look soon! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having read about the Sources
field in JSON files again (see Sources), I think that a simple sources
column in scans.tsv
may be a better solution, see my review comment.
I think the Sources
field would be a bit unwieldy for the goal you have ... it'd mean having to create a JSON sidecar for each file you want to describe a source of 🤔 That'd quickly become messy when you have lots of task and run entities.
thus I suggest to remove all "Sources" JSON references. But @adam2392 please hold off any more changes, I plan to discuss this with the other maintainers tomorrow evening, so my opinion might change :-) (don't want you to do more work than necessary)
@@ -452,6 +452,8 @@ For example, the [EDF](https://www.edfplus.info/) | |||
data format can only contain recording dates after 1985. | |||
Shifting dates is RECOMMENDED, but not required. | |||
|
|||
The source file path of each recording is also OPTIONAL and can be stored under the ``sources`` column. Similar to [derivatives](./05-derivatives/02-common-data-types.md), it is OPTIONAL to include ``sources``, specifying the filename(s) of the source file used to generate this dataset. If the filename(s) contains patient identifiable information, then it should not be stored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double backticks render code in RST, but for MD, it's single backticks 😉
Also: the files under the filename
column are relative to the scans.tsv
file (this is implicit unfortunately, but relatively clear from the context that this is meant by the text on line 435). --> What should the files under the sources
column be relative to? ... should we wait for #918 and make this BIDS-URIs? ... or should it be relative to the dataset root - as you seem to do in your current example?
Sounds good. I agree here.
Okay, any update here? Thanks! |
After some discussion, the consensus was that this problem at this point is too little a user-issue to warrant making some In the original issue, you make these two arguments:
This is something you can easily fix and track in your conversion scripts, I don't think this needs to be done in BIDS.
You are the first to bring up this issue (thanks for that) - but it's arguably a relatively small issue (and apparently not widespread), and I pointed out the solution in my first few sentences above (using arbitrary columns). @adam2392 I think we could add support for this in mne-bids without having to have this explicitly in the spec. Overall, I suggest closing the issue and this PR and to solve your problems from the tooling side. |
Sounds good to me if we can PR mne-bids. |
Fixes: #905
Add
Sources
to the sidecar json that point to the original source file.