Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: Associate "is_metadata" with Tag, not Entity; and only return non-metadata entries for core Entities in get(return_type='id') #749

Merged
merged 7 commits into from
Jul 26, 2021

Conversation

adelavega
Copy link
Collaborator

@adelavega adelavega commented Jul 2, 2021

If return_type = 'id' ("shorthand queries") or 'dir' and "target" is a core BIDS entity (i.e. one that is derived from filenames, not from meta-data), then only look at results that are not from meta-data.

If you ask for layout.get_runs(), it will only look at values that came from files, not meta-data.
This should prevent collisions from similarly named meta-data values.

However, if you ask for a meta-data target, then it will attempt to take a set across all values it finds:

In [5]: layout.get(target='SliceThickness', return_type='id')
Out[5]: [1, 2.5]

Related #694

@adelavega adelavega requested review from effigies and oesteban July 2, 2021 02:50
@oesteban
Copy link
Collaborator

oesteban commented Jul 2, 2021

Unfortunately, my knowledge about pybids falls short to make a good assessment of this PR (without spending a whole day to catch up with code). But I would say this will help with the API errors from the user perspective. I don't think it will address the DB problems though (see conversation triggered by #682 (comment)).

I believe that in addition to test the type of metadata being retrieved (i.e., whether it is an entity value or not), it would be beneficial and more robust to filter out values that do not match the corresponding regexp.

@adelavega
Copy link
Collaborator Author

Ah, sorry I missed that PR.

I think Chris summarized it nicely here: #682 (comment)

This PR is in the spirit of tal's suggestion to "finesse" the logic of __repr__ (which actually relies on get) to avoid crashing on meta-data coming from JSON and not file names.

The alternative is to catch this on ingestion of entities. That is, not read in TSV sidecars at all, like Tal suggested, and enforce type and regex incoming entities prior to adding to the Tag table in the db (although arguably isn't this the job of the validator?).

The one issue I have with this (aside from this fix being easier), is that then if you call get_metadata on a TSV file, it will not return those values, because get_metadata relies on the entities that were read in. That is a minor issue but still.

@effigies WDYT?

Comment on lines 678 to 679
base_entities = self.get_entities(metadata=False)
metadata = False if target in base_entities else True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ternary is just if not. And we don't reuse base entities. Maybe just a comment?

Suggested change
base_entities = self.get_entities(metadata=False)
metadata = False if target in base_entities else True
# Fetch metadata if target is not a filename entity
metadata = target not in self.get_entities(metadata=False)

results = [x for x in results if target in x.entities]
base_entities = self.get_entities(metadata=False)
metadata = False if target in base_entities else True
results = [x for x in results if target in x.get_entities(metadata=metadata)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of times we're calling get_entities feels high, and I think it's a DB query. What if we dropped this and did:

if return_type == "id":
    ent_iter = (x.get_entities(metadata=metadata) for x in results)
    results = list({
        ents[target] for ents in ent_iter
        if isinstance(ents.get(target, {}), Hashable)  # from collections.abc
    })

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from collection.abc import Hashable
Check that return_type == 'dir', still works.

@adelavega
Copy link
Collaborator Author

@oesteban: @effigies and I discussed this, and we think this PR should take care of the issue.

All entities being read in from file names are validated, and thus no invalid values should sneak in. Here, we limit it so that only real entities are queried when return_type='id'. That way, no BIDS invalid entities should be returned (no need to filter other types of queries, I don't think).

The alternative would be to do this upon ingestion of meta-data, and filter out meta-data that looks like an entity (but isn't) from being ingested. However, this causes two problems. 1) it's not illegal in BIDS (yet) and 2) if you .get_metadata on a TSV file that had an entity like entry its in JSON sidecar, it would be filtered, which is weird because its not a faithful representation of the dataset (which again, is technically legal).

I prefer this solution because it keep .get working but also allows for currently legal meta-data to still exist if the user wants to inspect it.

@adelavega
Copy link
Collaborator Author

Summary of how this should behave: if return_type = id and entity is a proper entity (that should be defined in filenames), then get_entities with metadata=False. Otherwise, rely on frozendicts to make non-hashable metadata (i.e. dicts) hashable (although what about lists?--- worry about this in separate PR).

@adelavega adelavega changed the title Only get non-metadata entity values if target is a core entity key RF: Associate "is_metadata" with Tag, not Entity; and only return non-metadata entries for core Entities in get(return_type='id') Jul 21, 2021
@adelavega adelavega changed the title RF: Associate "is_metadata" with Tag, not Entity; and only return non-metadata entries for core Entities in get(return_type='id') FIX: Associate "is_metadata" with Tag, not Entity; and only return non-metadata entries for core Entities in get(return_type='id') Jul 21, 2021
@adelavega
Copy link
Collaborator Author

adelavega commented Jul 21, 2021

Turns out to make this change possible, I had to change where is_metadata was stored in the db.

Previously, Entity objects had an is_metadata column, but this is set upon the initial creation of the Entity.

This means that an Entity could have Tags that are from both metadata and filename sources, but the Entity could either have is_metadata as true or false depending on the order the Tags were ingested.

For me, this meant that the "Task" Entitity said is_metadata=False even though it indeed returned meta-data entries (which caused layout.get_tasks to fail)

To fix this, I moved is_metadata to Tag in the db, and modified all of the queries accordingly.

In an example where the "task" Entity is defined in both meta-data and filenames, the following would happen:

layout.get_entities(is_metadata=True)
>>
{'suffix': <bids.layout.models.Entity at 0x7fbd984e4640>,
 'extension': <bids.layout.models.Entity at 0x7fbd984e43d0>,
 'subject': <bids.layout.models.Entity at 0x7fbd931a2c40>,
 'scans': <bids.layout.models.Entity at 0x7fbd931a2e20>,
 'task': <bids.layout.models.Entity at 0x7fbd931a2310>,
 'datatype': <bids.layout.models.Entity at 0x7fbd931a8f70>,
 'run': <bids.layout.models.Entity at 0x7fbd931a8640>}

layout.get_entities(is_metadata=True)
>>
{'age': <bids.layout.models.Entity at 0x7fbd98539880>,
 'comprehension': <bids.layout.models.Entity at 0x7fbd985399a0>,
 'condition': <bids.layout.models.Entity at 0x7fbd985a96d0>,
 'sex': <bids.layout.models.Entity at 0x7fbd985a9730>,
 'task': <bids.layout.models.Entity at 0x7fbd931a2310>,
...
}

That is, task is an entity which has Tag values both from meta-data and non-metadata sources.
For example:

ent = layout.get_entities(metadata=False)['task']
[t.value for t in ent.tags.values()]
>>>
[ 
   {'Description': 'tasks (story stimuli) collected for participant'},
   'sherlock', 
   'lucy',
   ...
]

Going back to the original problem, the get function would only look for non-metadata values for Task for get_tasks:

layout.get_tasks()
>>>
['milkyway', 'piemanpni', 'shapessocial', 'black', 'bronx', 'forgot', 'sherlock', 'tunnel', 'prettymouth', 'shapesphysical', 'lucy', 'notthefallintact', 'pieman', 'notthefalllongscram', 'schema', '21styear', 'notthefallshortscram', 'slumlordreach', 'merlin']

This value does not include the dict meta-data, as expected.

Copy link
Collaborator

@effigies effigies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like some stray middle-clicks.

bids/layout/layout.py Outdated Show resolved Hide resolved
bids/layout/layout.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@effigies effigies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good otherwise. Only question: Do we have a way of invalidating old database files?

adelavega and others added 2 commits July 25, 2021 17:34
Co-authored-by: Chris Markiewicz <[email protected]>
Co-authored-by: Chris Markiewicz <[email protected]>
@adelavega
Copy link
Collaborator Author

Thanks.

No, and it looks like it doesn't crash on initialization, but until you try to access a property that doesn't exist in the old db (i.e. layout.get_subjects())

Only way I can see us handling this proactively is adding BIDS versions to the db dirs, and throwing a warning if you load a saved layout from a previous version. Would be good for different PR.

@adelavega
Copy link
Collaborator Author

Tests passing, merging but opening a new issue for what you mentioned @effigies

@adelavega adelavega merged commit 6520748 into master Jul 26, 2021
@adelavega adelavega deleted the fix/get branch July 26, 2021 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants