-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Propose new mechanism to record metadata across separate data files, incorporate use of dwimap
#90
Conversation
Based on recent additions to the BIDS extensions principles in bids-standard/bids-extensions#26. Along the way, we follow the inheritance principle by reducing the number of metadata files in each directory to one per model, across all of the model parameters, and eliminating the distinction between model fit and model derived parameters.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## bep-016 #90 +/- ##
===========================================
+ Coverage 87.88% 87.93% +0.05%
===========================================
Files 14 16 +2
Lines 1279 1351 +72
===========================================
+ Hits 1124 1188 +64
- Misses 155 163 +8 ☔ View full report in Codecov by Sentry. |
sub-01_model-DTI_param_Sxx_dwimap.nii.gz | ||
sub-01_model-DTI_param_Sxy_dwimap.nii.gz | ||
sub-01_model-DTI_param_Sxz_dwimap.nii.gz | ||
sub-01_model-DTI_param_Syy_dwimap.nii.gz | ||
sub-01_model-DTI_param_Syz_dwimap.nii.gz | ||
sub-01_model-DTI_param_Szz_dwimap.nii.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want for it to be possible / permissible to define the individual components of the tensor model as individual NIfTI images, I think that warrants its own separate discussion. This would in the immediate case lie outside of the specification. The appropriate interpretation of how those files combine together to form a tensor representation would need to be somehow specified, in a way that establishes correspondence between tensor components and file names (unlike #74, where "tensor" is a keyword establishing a mapping from volume index to symmetric rank-2 tensor coefficient). Also the reference orientations could be the image axes rather than scanner axes, in which case "x" / "y" / "z" don't apply.
If the goal here is simply to have an example showing how there can be one JSON file and multiple NIfTIs, that can be achieved without introducing this new problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am definitely not attached to the idea that the tensor could be saved in six different files (and doubt anyone really wants to do that). It was indeed the goal to show that one json can refer to multiple niftis, but I agree that this just added more problems, so will remove.
- Only one metadata file, "`<source_keywords>[_space-<space>]_model-<label>_dwimap.json`" | ||
is allowed for each set of files that has a specific `model-<label>` entity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't workable as the rest of the BEP currently stands. Take for example the ball-and-sticks model (just the mean fit, ignoring the bootstrap realisations). You have some images that yield a scalar per voxel, and then you have the stick orientations, which are natively spherical coordinates. The specification needs to provide a way with which to encode the fact that the appropriate interpretation of the former is as a scalar parameter, whereas the latter is as spherical orientations, with a specific reference frame.
(While it's technically true that here you could distinguish between these two cases based on the dimensionality of the image, eg. 3D image = Scalar, 4D image with 2 volumes = spherical orientations, it doesn't extend to more complex cases. Mixture models could have all sorts of different components with different ways of encoding anisotropic information. Indeed you could just have different scalar model parameters that have different units. Then there's the fact that if you don't distinguish between model fit and model-derived parameters, you need to be able to avoid such ambiguities / conflicts for all possible derivative parameters.)
Previously I've tried to do this through complex inheritance. There would be one JSON file encoding everything that applied to all model parameters, and then a separate JSON for any parameter that required additional information necessary for the interpretation of those data that don't apply to other parameters from the same model. If you want to force a single metadata file per model, you would need to within that file define a dictionary containing parameter names and their corresponding metadata. While it is not my preference, since it would represent quite a fundamental change to how metadata is specified in BIDS (as opposed to complex inheritance which is just extension of an existing mechanism), it would be at least functional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - the intent is that the one json for the model would have sub-dicts, which can be used to refer to different dwimap
s. I'll clarify.
sub-01_model-DTI_dwimap.nii.gz # Contains the six diffusion tensor components | ||
sub-01_model-DTI_dwimap.json | ||
sub-01_model-DTI_param-FA_dwimap.nii.gz | ||
sub-01_model-DTI_param-MD_dwimap.nii.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is implicitly dependent on the very distinction between model fit and model-derived parameters that it seeks to remove.
Here there are three model-DTI
dwimaps
:
- "
FA
" - "
MD
" - unnamed.
If one knows that "the tensor" is the six coefficients of a symmetric rank-2 tensor that are fit to the image data, and parameters FA and MD are calculated from that, then promoting the former to be an unnamed parameter makes sense. But as soon as everything becomes a dwimap
, everything needs to be named.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - agreed. See above, I think that we could do param-tensor
to indicate the six tensor components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per #74 & other comments here, my intuition is to not use param-tensor
. The tensor is a data representation by which anisotropic information on S2 can be encoded. Also there's other discrete mixture models that may yield multiple tensors. Ultimately the quantitative parameter that these data are encoding is diffusion coefficient; it's just using a tensor encoding to represent the anisotropy of such.
In cases where a set of different parameters are contained within a single image | ||
file, entity "`_param-`" MUST NOT be included; e.g.: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may be an erroneous translation of the purpose of the text being replaced into the newer paradigm.
The original text can be re-phrased as follows. The purpose of the _param-
entity (as with basically all entities in BIDS) is disambiguation: to separate different parameters that are a part of the same model. In the scenario where all model fit parameters are encoded into a single data file, such disambiguation is no longer necessary, and so it makes reasonable sense to omit it.
The problem here then is two-fold:
-
As a nuanced criticism of the current phrasing, it's not specifically that "multiple parameters" are encoded in a single data file; it's specifically that "all parameters" are encoded in a single data file.
There may be cases where a single data file may indeed "contain" multiple parameters. I tried to transition the proposal away from such as much as possible, since if different parameters have different interpretations in terms of units or how they encode anisotropy it's more sensible to split them across data files. But maybe there's some sensible use case where a data file may contain multiple parameters but not be the only data file. In that hypothetical case I'd argue that omitting theparam
entity for that data file doesn't really communicate what's going on. Indeed it sort of implicitly provides the argument against such internal combination of distinct parameters. -
There were reasonable scenarios where this condition would previously have applied, since it's possible that for some models (eg. tensor, single-tissue CSD) the entirety of the model fit may be encoded in a single image file. However where
dwimap
provides no distinction between model fit and model-derived, for this condition to apply it would be necessary both for this to be the case and for there to be no other parametric maps yielded from that model (ie. formerly model-derived parameters). So it might be simpler overall if theparam
entity is simply compulsory.
As proof that this doesn't oblige nonsensical entity values in trivial cases:- A tensor image encodes diffusivity; it just uses a tensor representation (according to DWI derivatives: Make tensor a data representation #74).
- A single-tissue ODF encodes AFD; it just uses a SH representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. How about param-tensor
for the six tensor components and param-sh
for the spherical harmonic coefficients?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
param-tensor
and param-sh
is the opposite of what I'm advocating. Those are data representations describing how anisotropic information on S2 may be encoded. They don't say anything about the quantitative parameter being encoded. Those are the diffusion coefficient and AFD respectively.
(This might seem pedantic for these easier cases, but I've spent a lot of time thinking about the consequences of these decisions for more complex models)
sub-01_model-DTI_dwimap.json | ||
sub-01_model-DTI_param-FA_dwimap.nii.gz | ||
sub-01_model-DTI_param-MD_dwimap.nii.gz | ||
sub-01_model-DKI_dwimap.nii.gz # Contains the fifteen kurtosis tensor components |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I don't think it makes sense to have valid interpretation of a DKI model being predicated on the contents of a DTI model that is explicitly encoded as being a different model. The fact that these data are split across different data files with different _model-
entity values to me means they are independent.
I never committed to how to deal with kurtosis because even for the tensor it didn't reach the state I wanted (#74). With DKI I would want the logic of #74 to be extended to cover the rank-4 tensor also being a data representation, with the specification indicating the ordering of tensor coefficients across image volumes.
There might be a debate to be had over whether the six rank-2 coefficients should be in one file and the fifteen extra coefficients in a different file, versus all coefficients encoding the rank-4 tensor being in a single image (or having the specification facilitate both alternatives). But the six rank-2 coefficients should not IMO be distanced so far from the kurtosis model fit so as to be attributed to an entirely different model. For one it wouldn't be possible to calculate from just the 15 kurtosis coefficients values of FA and MD as shown here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry - this was just an error on my part. I got carried away wanting to show how this would generalize to DKI as well. This should have been "the twenty one tensor components" - I was not envisioning separating the six from the fifteen in some over clever way. Here, I was imagining a situation where someone runs a DTI fit and then also a DKI fit and has both the results from the DTI and the DKI side-by-side. I'll try to clarify.
@@ -486,14 +486,8 @@ another. | |||
"Parameters": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relates to earlier comment above.
For items that relate to how a model was fit to empirical data, probably require a more specific dictionary name than just "Parameters
". I wasn't happy with this previously, but now if the intention is to provide metadata that applies to specific parameters within a single model JSON, the ambiguity is even worse.
The more fundamental discussion here is the approach of metadata placement. What's being proposed is taking metadata that is requisite for interpretation of a specific data file, and instead of placing it in a paired sidecar metadata file, instead placing it into a metadata file that is applicable to multiple data files; then, for a given data file, to determine what sidecar information is vs. is not applicable to that data file, one must seek correspondence between sub-dictionary name and the value of specifically the _param-
entity. To me personally, this isn't just a minor "trick" to get things working; it's a complete paradigm shift in BIDS metadata encoding, that would have massive consequences for every BEP that follows, and it's hard to envisage the potential downstream problems of any such things. I appreciate that a lot of people haven't followed my logic on inheritance / "parameters" / etc., but this is one of many alternatives that I considered and judged myself to be more problematic. It's certainly not my choice, hence why I've tried to enumerate / elucidate the different prospects and their pros & cons. Indeed personally I'd take neglecting the use of inheritance and just duplicating metadata across sidecar files over this option.
But if you want to go with this option, there would at least need to be other changes in the specification to say "if you have an entity-linked file collection, and want to specify some metadata that applies to the whole collection and other metadata that applies to individual files, you can't do that via inheritance, so instead the way to do it is to construct within your single metadata file a dictionary where there is correspondence between the name of the dictionary and the value of the data file entity". And you'd need to think about both whether those sub-dictionaries should be at root level or within some fixed dictionary intended for specifying per-data-file metadata, and whether it needs to be explicit about which filename entity is encoded within the dictionary.
Perhaps the obligation of explaining in the specification exactly what is being proposed here might demonstrate why I personally believe that achieving this by generalizing the inheritance principle is in fact the more elegant solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this has implications beyond the scope of this BEP and would be really happy for input on this issue from a wider set of contributors / maintainers / stake-holders. Maybe an issue on the spec repo for further hashing this out? Do you have a link to any previous discussions where the implications of this were fleshed out? In the long run, would bids-standard/bids-2-devel#65 (in any interpretation) solve this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a link to any previous discussions where the implications of this were fleshed out?
What you're doing here is Decision 1 Option 5 in #50.
In the long run, would bids-standard/bids-2-devel#65 (in any interpretation) solve this?
It has the prospect of "solving" in so far as that:
- My desired structure can only work if more advanced inheritance is enabled
(specifically removing the inability to have more than one metadata field applicable to a data file in a single filesystem hierarchy level) - There has been resistance to such from others
(To the point where a completely different proposal to solve the same problem was attempted in ENH: Proposal formodel-<modelname>
directory bids-specification#1280) - If removing the ability to overload metadata fields would reduce the resistance in 2. to the change in 1., then that would facilitate my proceeding with that structure.
There's still limited feedback on that proposal, and there's also the fact that it's currently only discussed in the context of BIDS 2, and therefore if adopted as a solution for BEP016 then BEP016 would have BIDS 2 as a prerequisite. However I do think that that change could be reasonably proposed as a standalone change to BIDS 1 (potentially even one that those who don't like the inheritance principle would support, since it "weakens" the capabilities of inheritance). Maybe if that were fleshed out fully (including discussion of what might be required for the validator), and accepted, then I could re-attempt proposal of the augmentation of permitting multiple inheritance at a single directory level.
A further complication is that in my prior proposals, I utilised this advanced inheritance in conjunction with the MFP / MDP distinction to have the ability to associate metadata fields only with the immediate outputs of the model fit, and for them to not be associated with anything further downstream. Sometimes things that are computed from the results of a model fit may themselves have their own parameters describing how they were themselves calculated; therefore I wanted to separate them out. However it may turn out to be acceptable enough to take the metadata relevant to a given model fit, and associate that metadata with both the "primary" (MFP) and "secondary" (MDP) outcomes of that model fit.
Dimensions of NIfTI image "`sub-01_model-bs_param-sticks_dwimap.nii.gz`": *I*x*J*x*K*x9 ([spherical coordinates](#data-spherical), distance from origin encodes fibre volume fraction) | ||
Dimensions of NIfTI image "`sub-01_model-bs_param-sticksboostrap_dwimap.nii.gz`": *I*x*J*x*K*x9x50 ([spherical coordinates](#data-spherical), distance from origin encodes fibre volume fraction; 50 bootstrap realisations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dealing with bootstrapped vs. aggregated model fits probably needs to be re-thought in the context of this PR; I don't think it's as simple as taking the prior example and renaming files.
Where there is a distinction between model fit and model-derived parameters, one utility of that distinction is that it is wholly applicable to the separation of bootstrap realisations from the aggregate across realisations. The same software command may yield both, but that's just a detail of the current state of the interface of that specific software command. From a data processing perspective, what is yielded from the process of fitting the model to the empirical data is the bootstrap realisations. Once you have those, you can freely compute statistics across realisations; reference to the empirical image data is no longer required.
In the absence of that distinction, you need some other way of distinguishing between the individual realisations and the aggregate. Trouble is, the "parameter" of the model being encoded is exactly the same; it's a question of whether you have multiple instances or just one. Therefore I'm not a fan of that distinction appearing as part of the "param
" entity. I had previously in #61 played with the prospect of using a "_stat-
" entity to express the fact that there's some aggregate statistic being computed across some data dimension; in this scenario it's not a trivial "average the values in the image", it's computing an average orientation in S2 across bootstrap realisations, but the concept still stands. Naively it feels to me a more appropriate place to encode / disambiguate this information, but conversely it potentially has other consequences in terms of eg. expressing not only the statistic that was computed but along which dimension, and then it starts to look more like provenance. Another alternative would be to have the bootstrap realisations vs. mean across bootstraps as different "_model-
"'s, though I'm unsurprisingly not a terribly big fan of that either. I suspect that at the time I found the MDP vs. MFP distinction to be the most functional here, even if it doesn't conform to people's preconceptions of what these data are; so if that's no longer there, this needs a fresh contemplation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather leave this up to the researchers using these files. Presumably they need to understand the difference between bootstrap samples and summaries across these samples if they are using this model, and they can choose how they would like to handle any of these issues. We might want to clarify in the metadata how the summarization (into sticks
) happened (median bs? mean bs? A run on the intact sample?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deferred thread to #38.
A bit of clarification about metadata sub-dictionaries in 05e94e5: Through a particular lens / for certain contexts, having metadata for different model parameters all encoded within a single model metadata file can seem preferable. A particular example that comes to mind is response functions for a multi-tissue CSD fit. Having the response function for one individual tissue stored as metadata for just that one tissue is potentially misleading, as that one response function in isolation is not sufficient to explain how the ODF was generated. What defines the model inputs is the complete set of response functions used. In which case, there's an argument for defining those data all together in a single model metadata file, rather than allowing them to separate from one another across the individual ODF sidecars. The converse situation is the one that in my opinion provides the strongest argument against this structure. Take an ODF, sticking with the MSMT CSD example. In order to robustly interpret an individual ODF data file, you must know:
The advantage of my prior structure using the (augmented) inheritance principle is that this information, specific to that one model parameter and crucial for its robust interpretation, would be what appeared exclusively in the sidecar JSON for that one parameter. With this proposal as it currently stands, that information is being deferred to a metadata file that applies to all parameters yielded by that model (or even the derivatives of such), therefore requiring some form of sub-directory referencing (which would probably need to be more robustly defined & systematized than what is in 05e94e5) to infer which metadata fields are relevant vs. not relevant to that data file. What confuses me most here is the difference in preference between what I have historically advocated for vs. what's being proposed here. Note that what I'm speaking about here specifically is not actually tied to the adoption of the I established a long time ago that there was an intrinsic conflict between:
Given my attempts at making the requisite changes to the inheritance principle didn't succeed, an attempt at an alternative solution was made in bids-standard/bids-specification#1280, and that didn't pass the steering committee. This PR, while currently titled as targeting the This is unfortunately quite personally frustrating as I've written extensively on the need for a decision to come from beyond myself on how to deal with this complexity, have either not received feedback or been explicitly told that it's not a problem, and now an alternative is being pushed in a PR that on the surface purports to be doing something else. I appreciate the desire to push this forward to a functional solution; I've truly been trying to clear my plate to get back to this but am so far behind on obligations I've not even been on my own software forum in 2 years. I nevertheless still have to argue that doing this via removing a (potentially ill-conceived) restriction on the inheritance principle is a better long term solution than some kind of completely novel one-to-many cross-referencing mechanism between metadata and data files that would directly compete with the inheritance principle. At the very least, this PR needs to either:
|
dwimap
.dwimap
, propose new mechanism to record metadata across separate data files.
dwimap
, propose new mechanism to record metadata across separate data files.dwimap
Thanks for all the input! I changed the title to reflect and edited the comment at the top to emphasize what is probably the more substantial change proposed. I am sorry about the frustration that you are feeling, and share some of the frustration that you feel as well. I think that part of the challenge here is that we are working on this separate repository, away from the core of the work on the bids-standard/bids-specification repo. To advance the discussion, I would like to suggest the following: how about if I rebase this branch on top of the main spec branch and make a pull request from this branch against bids-standard/bids-specification:main? This way, we could get input from a broader set of contributors, but I want to make sure that you don't feel like you are being sidelined and/or ignored, or that you feel like anyone is trying to make an end-run over your objections. I should also mention that @effigies and @rwblair are both in Seattle at the moment, and we did discuss this change in person today. They did not seem too averse to it, but I think that it would be good if they could hear your objections and the alternatives that you propose more directly. We can point to your previous efforts (particularly in #50), so they can see the pros/cons for these different proposed changes and we could discuss implications for other modalities/models, and how to move forward here. |
Hi @Lestropie : just checking in to see whether you've had the opportunity to consider my comment from last week. How would you feel about moving this whole discussion to the main spec repo, where we could hopefully get more visibility on the issues. Even if we end up rejecting this proposal, based on your criticism, it might help raise the visibility of the issues, tie it to the broader issues that you are pointing out, and help folks understand better when we make suggestions changes with broader impacts (e.g., changes to inheritance, adding nested folders, etc.). I don't want to do this over your objections, though, because I don't want you to perceive this as an attempt to over-rule the objections you raised here, so I'd appreciate your thoughts just on this procedural issue. Maybe just a PR to introduce the idea of nested JSON meta-data for now (if only to give visibility to the objections you raised in #50, which may have gone un-noticed by the broader community, because it's buried in this repo)? |
It is probably the case that I need to re-write #50 on the main spec repo and attach a poll to it. I'd prefer not to attach any specific proposed modification to it as that may bias the result toward whatever is proposed there. I have a couple of urgent tasks right now but I will try to get to it. |
That's a good suggestion If you're up for it that would be a good way forward, I think. Maybe we can focus on decision 1 first? |
Pretty confident there's consensus that we don't want to be creating a unique suffix per parameter. There's potentially going to be problems around using one suffix for all parameters (as opposed to two, for which in retrospect I think I failed to make the strongest case), but as a first pass I'd rather present this issue on the presumption of use of |
Having commenced writing something, I'm now reminded of bids-standard/bids-specification#1339. The Steering Committee has explicitly rejected the prospect of a more complex Inheritance Principle on the basis that existing software may make erroneous conclusions about associations between data and metadata; presumably mitigating against such via version compatibility checks was considered too optimistic. But I would have thought then that your proposal of having, within a metadata file applicable to a given data file, metadata fields that are explicitly not applicable to that data file, would be rejected on the very same basis. So I'm trying to describe the situation before listing candidates, but I'm not sure if there's even multiple candidates left to be offered predicated on BIDS 1.0 compatibility. |
I do believe that's why we chose to go with what we have here, but I think that it could use some reconsidering. I think that your objections do make sense, and maybe we need to revisit bids-standard/bids-specification#1280, as the introduction of |
Not necessarily, due to their interaction. Part of the reasoning behind the In the final linked comment you suggest resolving through the use of suffices. But as described elsewhere that comes with its own set of problems. I don't think that it is a necessary consequence that rejection of the So there's maybe two different questions here:
|
Correction, already exists: bids-standard/bids-2-devel#54 |
Closed in favour of #92. |
Attempts to follow the inheritance principle by reducing the number of metadata files in each directory to one per model, across all of the model parameters, using hierarchical organization within the side-car files.
Also: eliminates the distinction between model fit and model derived parameters, based on recent additions to the BIDS extensions principles in bids-standard/bids-extensions#26.
EDIT: Edited to reflect that the change to metadata structure is probably the more substantial change proposed here.