-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Q: Flatten metadata keys? #694
Comments
@oesteban @effigies note that automatic flattening would also solve the problem in #682 for free, as "run" would now be "runDescription", and hence would never be scanned in the offending code inside |
What about I guess it does mean that |
Fine with me. I agree that adds clarity.
I don't, and TBH I feel like we should maybe scale back on a lot of the aliases and helper functionality in the next major release. We should probably save a list of proposed breaking changes somewhere for 1.0. I would also add removal of the |
I believe this test https://github.com/bids-standard/pybids/pull/684/files#diff-9562e9d99fe83db8bacdd0aa90936114ad0a87d40bdb12a470381e19e5e8774cR460 would fail currently with master (while #683 is not merged). And that test is not featuring examples of invalid BIDS metadata. Let me resync #683 with master and replicate the problem. Then, if BIDS-Derivatives starts making use of nested dictionaries, there's potential that we will need to backtrack the flattening. Finally, it seems that querying is working fine - as you said in the other thread, we really need to fix the test, not the tool :D. And a test that breaks things is a valuable test, perhaps invalids from the BIDS spec perspective, but very representative of how people may stretch pybids. In favor of dropping features that don't really offer great value (e.g., return types). But I'm not fully confident that flattening dictionaries is going to make a dent in the maintenance burden. I would keep the idea and wait for BIDS-derivatives to evolve and see how complex metadata can get. |
Actually, this problem would happen again if some metadata file overshadows potential So, if the user has a BIDS directory with 2 different run ids under {
"run": ["unrelated", "values"]
} Then this flattening would fail to avoid the clobbering, and so: >>> set(layout.get(return_type='id', target='run')) would break unless #684 is merged, after which: >>> set(layout.get(return_type='id', target='run'))
{("unrelated", "values"), 1, 2} which IMHO is the most appropriate outcome for someone doing this stupidity. |
The current tests failures are due to internal problems with handling collections (instead of basic dtypes), so your patch won't fix them. E.g., one of the failures is arising because we assign the metadata value inside pandas DFs, and it doesn't like when the value is a dictionary. This can be worked around on a case-by-case basis, but I would prefer not to do that—I don't like the idea of dicts being a legal value for essentially any entity. I would rather force the user to (a) drop nested values, (b) flatten them, or (c) accept that things might break because they insist on passing around collections internally. Basically the problem is that if we don't do something to handle dict/list metadata values, PyBIDS will break in certain cases. So our options are to either adopt something like the proposal above, or sweep through the codebase and handle a bunch of special cases (the most common class probably being when converting between the SQLAlchemy representation and pandas structures, which happens a lot), add another abstraction layer, etc. My preference would be for the former, but I could be convinced to do the latter, I suppose. |
Okay, on revisiting the various threads, I guess we're still kind of stuck. @oesteban's patch in #684 fixes the specific problem that spawned this whole discussion (in #681), but it doesn't address the general problem of dealing with nested collections, which could theoretically arise in many places throughout the codebase. Also, it makes returned collections immutable, which is on the one hand a pretty sane way to deal with the problem, but on the other hand is counterintuitive for users, and also adds a non-standard type ( By contrast, the approach I proposed in the first comment of this thread is more principled, but it only works for dictionaries—we can't really flatten lists. (I don't think we want to represent I think maybe the best way forward here is to combine the two approaches by (1) treating dictionaries per my proposal, and (2) making lists hashable by converting them to tuples, as per Oscar's patch in #684. The only case this leaves unhandled is where a value in a JSON sidecar is a list containing collections as elements. In that case, an exception would still be raised. But that latter scenario seems to me pretty unusual, so I incline to ignore it for the time being. |
Sorry for not getting to this more quickly. I think it makes sense, and can't readily think of a better approach. |
It seems like part of the problem is that we are mashing together meta-data with entities. Entities are a strict set of reserved key words, which are defined as having a string or scalar value. Conceptually that seems odd to me. Otherwise, I would be fine with flattening meta-data keys, that would make it relatively easy to use nested meta-data in filters and in practice solve most problems. Also, are there that many places in the code base where non-hashable types will cause problems? Perhaps we could also just ignore non-hashable types in those places (versus making them hashable). |
I would rather say that BIDS doesn't disallow or even discourage defining new metadata fields with three name of an existing entity. While this is relatively obvious, it will easily become a rabbit hole with derivatives. Flattening solves the DB face of things. A more robust solution would be that entity regexps were applied to filter all these shorthand queries. |
Agreed its a spec problem, but can you elaborate what you mean by "shorthand queries"? Without flattening, if somebody said |
Sorry, I mean
Probably not, because there's nothing to |
Ah, sounds good to me. More generally, we could say that if the About |
Question: Would it ever make sense to specify a BIDS entity (i.e. If so, an easier solution is that if I'll push a PR with that. |
An issue that's cropped up a couple of times, and is the underlying root cause of the current testing failures (see peripherally related discussion in #682) is that we have no good way of dealing with JSON metadata that have collections as values.
Currently, these are naively stored in the DB as dictionaries or lists, but that can cause problems throughout the codebase, as it's not really reasonable to check in a whole bunch of places to make sure that
Tag
values are hashable and respond accordingly. Plus, it's unpredictable from the user's perspective.The proposal is to either (a) ignore nested metadata; (b) automatically flatten them (e.g.,
{"KeyA": {"KeyB": "myvalue"}}
becomes{"KeyAKeyB": "myvalue"}
; or (c) do both and let the user control that with an optional flag passed at indexer initialization (default value being to ignore). We could even have a third mode that does nothing (i.e., current behavior), allowing the user to access values as collections, but with the expectation that failures might occur unpredictably.Thoughts? I incline towards (c) (without a "do nothing" option).
The text was updated successfully, but these errors were encountered: