Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: Make lists and dicts hashable #684

Merged
merged 2 commits into from
Dec 13, 2024

Conversation

oesteban
Copy link
Collaborator

Resolves: #683.

bids/utils.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@tyarkoni tyarkoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks. I'm wondering if we should kill off the 'dir' and 'id' return types in the next major (breaking) release... they seem to cause a lot of problems, and the efficiency gains are probably not meaningful compared to just looping over objects. But we can take that up elsewhere.

@oesteban oesteban force-pushed the fix/issue683-more-hashable branch from 9fb388b to 0b7dcfc Compare January 12, 2021 20:30
@oesteban oesteban force-pushed the fix/issue683-more-hashable branch from 6455905 to c48076f Compare January 12, 2021 21:48
@oesteban
Copy link
Collaborator Author

Rebased on top of #695 - please merge that one first.

@tyarkoni
Copy link
Collaborator

tyarkoni commented Mar 3, 2021

@effigies @oesteban what did we decide to do about this? I'm okay merging it, but have a vague recollection there might have been some concern (possibly on my side, but if so, I can no longer reconstruct it).

@effigies
Copy link
Collaborator

effigies commented Mar 3, 2021

I think #694 was the discussion that postponed this decision. I've also forgotten some context here...

@tyarkoni
Copy link
Collaborator

tyarkoni commented Mar 3, 2021

Oh, right, thanks. I'll try to work up the proposal I suggested there.

@adelavega
Copy link
Collaborator

@oesteban, @effigies and I chatted and decided that the frozendicts may cause problems for users, since its unfamiliar.
Implementing flattening dicts is probably a better way to resolve this, and enable querying for nested dicts. We can put it on the roadmap.

Do you agree? If so I think we can close this PR.

@oesteban
Copy link
Collaborator Author

oesteban commented Jul 3, 2021

First things first, and as I told @tyarkoni on one of the related issues, I'm likely not going to maintain this. So you shouldn't take it if you think this will be too onerous to maintain (although, honestly, I don't see a big burden - but I'm not to decide).

That said, I think the flattening solution only addresses the inner part of the problem and conflates two different indefinitions of BIDS: i) the one about whether one can define dicts as metadata values; and ii) whether it is allowed to name a metadata piece with the same name of an entity. Both problems arise in pybids under this unhashable type (ii only if there are dicts defined, but there's the question of whether pybids returns the adequate values when it doesn't crash in this case).

Flattening dictionaries would address i) and incidentally solve ii) partially. But it is just a data representation problem of which the user doesn't need to know. This means, are we going to "unflatten" when returning these values? Because I bet users will be more comfortable receiving a dict RatherThanSomeCamelCaseFlatteningOfSomething (which we don't know yet the whole domain of possible values nested and number of nestings, btw). Nipype's results file had to deal with a similar situation, and the resulting code is a bit brittle and hard to maintain. I wonder if alquemy would take frozen dicts right away.

Then there is the question of having to return unique values of a given entity. I honestly think that returning something that looks like a dict, behaves like a dict and (if we cast the dict back before returning) is sometimes an actual dict is more convenient/intuitive to the user than the flattened values. This also ensures the sanity of the values in all get_(metadata)s. Say we have sessions with the entity key overloaded in some TSV+JSON files. In such a situation, we may get:

>>> get_sessions()
['01', '02', {'Time': '00:00', 'Duration': 30}]

In this case, to give the user something useful, we don't even need to apply the entity's regexp to return the right thing. It would be more difficult when the "session" in the TSV+JSON file is some string value:

>>> get_sessions()
['01', '02', 'comes-from-metadata']

But none of the other solutions would solve this problem, as it stems directly from the indefinition of BIDS.

I admit though that this solution does nothing to fix the problem of inserting dicts in the db. Unless, inserting frozen dicts would be an acceptable option, which would doubly justify this PR.

And this is my 2ct.

@adelavega
Copy link
Collaborator

adelavega commented Jul 3, 2021

Oscar, thanks for the spirited defense.

To be clear, my suggestion to solve (ii) with respect to "shorthand" functions for BIDS entity is primarily dealt by #749. I think this PR (or flattening dicts) takes care of (i) (how to represent hashable types).

> >>> get_sessions()
> ['01', '02', {'Time': '00:00', 'Duration': 30}]

For this example, I prefer the results of #749:

>>> get_sessions()
['01', '02']

For shorthand functions (i.e. get_(entity)s, where entity is a proper BIDS entity), get would only look at values coming from filenames. I believe according to BIDS, entities are by definition only defined in filenames, not meta-data sidecars.

Arguably, get_sessions should return all values it finds, even the entity-like value session defined in meta-data. But I think it's fair to just say that shorthand functions for Entities, should only look at values defined in filenames. I don't see a scenario where getting {'Time': '00:00', 'Duration': 30} as a result is useful, and it could cause problems for automated tools that rely on pybids for ingestion. I think it's fair for pybids to decide this is how get works, even if the spec has some indefinition.

Also, since filtering happens only on the get function, this value ``{'Time': '00:00', 'Duration': 30}` is associated with the corresponding file, if needed.

If we agree on #749, then the question is how to represent dicts for actual meta-data (to avoid crashing on shorthand functions such as get_SliceTiming). My main bias for flattening was simplicity but I'm open to your solution. I agree for arbitrarily nested dicts, frozendicts may be a nicer result.

Three questions remain for me about frozendicts:

  1. Do they cause any problems in the db?
  2. Do they represent lists? get_SliceTiming crashes now bc results are lists. Flattening dicts would not fix this.
  3. What if I wanted to query a nested element?
    For example: layout.get("PipelineDescription.Version"="20.0.5")

If 1) and 2) are not problems with frozendicts then I think #749 + this PR are compatible and would solve our problems.

@oesteban
Copy link
Collaborator Author

I'm trying to wrap my head around rebasing this, @effigies . I thought it would be more straightforward, but the new "id" and "dir" filtering is not so clear to me yet.

@oesteban oesteban force-pushed the fix/issue683-more-hashable branch from c48076f to 80e9d2e Compare April 14, 2022 12:33
@oesteban
Copy link
Collaborator Author

I think this is it. I have removed one level of branching which I hope doesn't break tests. There is one new unit test that was made the code fail with IntendedFor, but it'd be nice to also add another test with the B0FieldIdentifier and B0FieldSources (#834) to make sure it doesn't break.

cc/ @effigies

@codecov
Copy link

codecov bot commented Apr 14, 2022

Codecov Report

Attention: Patch coverage is 78.12500% with 7 lines in your changes missing coverage. Please review.

Project coverage is 86.07%. Comparing base (5ea3103) to head (9a887bf).
Report is 643 commits behind head on master.

Files with missing lines Patch % Lines
bids/layout/layout.py 72.72% 3 Missing and 3 partials ⚠️
bids/utils.py 90.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #684      +/-   ##
==========================================
+ Coverage   86.03%   86.07%   +0.03%     
==========================================
  Files          35       35              
  Lines        3924     3934      +10     
  Branches     1017     1023       +6     
==========================================
+ Hits         3376     3386      +10     
- Misses        355      356       +1     
+ Partials      193      192       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bpinsard
Copy link
Contributor

bpinsard commented Nov 5, 2024

Any roadmap for that fix to be merged?
This would be very much needed for sdcflows wrangler to work properly on some BIDS-valid datasets with list B0 intent.
Thanks!

@adelavega
Copy link
Collaborator

Oof, this probably needs a major re-write since it's a pretty old PR.

@oesteban any recollection as to why this wasn't merged despite getting a PR approval?

@oesteban
Copy link
Collaborator Author

oesteban commented Nov 5, 2024

hehehe I don't remember 😓

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BIDSLayout.get -- TypeError: unhashable type: 'list'
5 participants