Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH]: exclude common non-BIDS data sources + evade potential named tuple problem #277

Closed
wants to merge 3 commits into from

Conversation

adswa
Copy link
Contributor

@adswa adswa commented Oct 31, 2018

This PR

  • fixes one minor typo,
  • wants to have layout exclude non-bids conform datasources
  • wants to have layout extract only the path (instead of using the large self.layout.get_file(f) object) and evade a potential problem with named tuples

# A set is extended with non-BIDS-defined common suspects
# which are not expected to carry any data of interest
excludes = {"code", "stimuli", "sourcedata", "models", "derivatives",
".git", ".datalad", ".cache", ".workdir", "workdir"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than hard-coding these, I think it's reasonable to expect users to maintain a .bidsignore file, as these should also cause validator issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

> cat .bidsignore
inputs/
workdir/
eyetrack/
README.md
model-frst_smdl.json
CHANGELOG.md

this is what I have in my .bidsignore, files in workdir were still considered.
@yarikoptic remarks: "Would it ever makes sense to consider .git files?"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree .git should never be considered. I'm not sure .bidsignore reading is implemented. That would be a valuable contribution. But I also think sensible hardcoded defaults (such as these) make sense.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please update the docstring

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to specify sensible defaults, I'd be inclined to disallow all .* files except .bidsignore.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason these particular values are hardcoded is that they're explicitly mentioned in the spec as places people should put stuff, but we don't want to index them by default. The other directories (e.g., .git) are not part of the spec, so I don't want to give them special handling. I we start adding things like workdir/, there's no limit to what people might want to add.

I agree that .bidsignore support is the right way to handle this. Pybids currently passes (in kwargs) an exclude argument onto grabbit, which one can use to pass arbitrary regexes that (I believe) are matched against every file. So probably the easiest way to implement .bidsignore support would be to add entries in .bidsignore to the exclude list. There may be unintended consequences, however, so this should be investigated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, but I think excluding .* folders by default is probably a good idea (when would it be a good idea to scan them in?) Other folders can go into a bidsignore

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to specify sensible defaults, I'd be inclined to disallow all .* files except .bidsignore.

Just curious -- what for to allow/consider .bidsignore? if any function would be needing to treat this this file, it would just test for its presence and open it. It doesn't need it to be a part of the Layout schema/pool of files, since there is nothing to query for in its name.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other directories (e.g., .git) are not part of the spec, so I don't want to give them special handling. I we start adding things like workdir/, there's no limit to what people might want to add.

I agree that .bidsignore support is the right way to handle this.

I agree too! Support of .bidsignore would allow for all those workdir etc. But also I agree with @adelavega -- I do not think in any BIDS spec we would/should allow/be interested in files/directories starting with . for the purpose of paths matching etc.
As I have pointed above, .bidsignore is a special file which is not part of the path layout template matching and actually not a part of the BIDS specification itself! Its existence was instigated within bids-validator, so it somewhat of a tooling level file. So it should IMHO be ok to ignore all . files/dirs in general for pattern matching. I think it might even be worth to "hardcode" such consideration in BIDS specification itself? e.g.

"Note on BIDS dataset tools support/development: No directories or files in BIDS are generally named with leading . and thus should be ignored for the purpose of discovery of relevant data files. A .bidsignore file can be used by dataset users to inform tools of additional directories/files which should also be ignored by the tools. The format of .bidsignore file is .."

Copy link
Collaborator

@tyarkoni tyarkoni Nov 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed—let's add "^\." to the list of default excludes (if I'm not mistaken, they're regexes by default) and remove the other entries (i.e., workdir and anything that starts with .). We can add .bidsignore support in a separate PR, unless someone wants to take a pass at it here. It shouldn't be difficult—unless I'm missing something, I think just adding all the lines in that file to the internally-constructed exclude list that gets passed onto grabbit's Layout initializer should do the trick.

I don't have strong feelings about whether or not the spec should mention . and .bidsignore. I guess I lean towards not for the time being; to my mind the place to put that would be some kind of developer guide or appendix.

@codecov
Copy link

codecov bot commented Oct 31, 2018

Codecov Report

Merging #277 into master will decrease coverage by 0.08%.
The diff coverage is 66.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #277      +/-   ##
==========================================
- Coverage   71.16%   71.07%   -0.09%     
==========================================
  Files          24       24              
  Lines        2507     2510       +3     
  Branches      609      610       +1     
==========================================
  Hits         1784     1784              
- Misses        558      561       +3     
  Partials      165      165
Flag Coverage Δ
#unittests 71.07% <66.66%> (-0.09%) ⬇️
Impacted Files Coverage Δ
bids/layout/layout.py 73.89% <66.66%> (-0.76%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e35ced6...89e41ef. Read the comment docs.

Copy link
Collaborator

@tyarkoni tyarkoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hard-coded list should only include directories mentioned in the spec. See comments.

@@ -563,18 +566,24 @@ def index_file(self, f, overwrite=False):
an entry already exists.
"""
if isinstance(f, six.string_types):
f = self.layout.get_file(f)
filename = self.layout.get_file(f).path
elif isinstance(f, tuple):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation for this? For methods like this, I'd prefer to minimize the number of valid types that can be passed as input. It's cleaner if the caller has to first extract a string or BIDSFile, I think. Plus, long term, I'd like to get rid of the namedtuple entirely—I think it mostly just adds confusion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Motivation was pointed to below this line: http://github.com/grabbles/grabbit/pull/84 . In the effort of using fitlins with current pybids we ended up here with the File tuple which has no .path (see that issue for our past discussion). Since our attempts are largely shots in the darkness trying to maneuver between fluid APIs, I did not want to try to change more than minimally necessary and this change was such a minimal change to stay compatible possibly with older and newer pybids API.

Copy link
Collaborator

@tyarkoni tyarkoni Nov 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yarikoptic How would you feel about getting rid of the namedtuple entirely, and just returning the File object by default? The main functional difference is that entity names are attributes in the tuple, but not in the File class. But for the sake of backward compatibility, I'd be okay with hacking getattr to check in the .entities dict before raising an exception. That way I think nobody's existing code would break, and we would reduce code complexity. @effigies @adelavega feel free to chime in also.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no problems with it, as long as its backwards compatible. It does seem a bit redundant and confusing two have two types of objects that can be returned.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened #281 to deal with this.

# A set is extended with non-BIDS-defined common suspects
# which are not expected to carry any data of interest
excludes = {"code", "stimuli", "sourcedata", "models", "derivatives",
".git", ".datalad", ".cache", ".workdir", "workdir"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason these particular values are hardcoded is that they're explicitly mentioned in the spec as places people should put stuff, but we don't want to index them by default. The other directories (e.g., .git) are not part of the spec, so I don't want to give them special handling. I we start adding things like workdir/, there's no limit to what people might want to add.

I agree that .bidsignore support is the right way to handle this. Pybids currently passes (in kwargs) an exclude argument onto grabbit, which one can use to pass arbitrary regexes that (I believe) are matched against every file. So probably the easiest way to implement .bidsignore support would be to add entries in .bidsignore to the exclude list. There may be unintended consequences, however, so this should be investigated.

@tyarkoni
Copy link
Collaborator

The namedtuple is now deprecated, and a BIDSFile is returned by default. @AdinaWagner, is still relevant, or can we close it?

@tyarkoni
Copy link
Collaborator

Oh, right, there's still the outstanding question of how to handle exclusions. I think my suggestion to add ^\. to the list of default exclusions still makes sense. May be easiest to close this and open a separate PR though.

@yarikoptic
Copy link
Collaborator

I think ignoring ^\. paths sounds like a reasonable idea and also will be consistent with bids-validator. But for anything in addition -- please see/continue at bids-standard/bids-specification#131 . Meanwhile I guess we will just hardcode the ones we need to exclude in the custom fork for now. Alternative could be to implement a very very basic support for .bidsignore, just taking simple globs OR using https://git-scm.com/docs/git-check-ignore but that would make git a hard dependency

@effigies effigies mentioned this pull request Feb 1, 2019
7 tasks
@effigies effigies added this to the 0.8.0 milestone Feb 1, 2019
@tyarkoni
Copy link
Collaborator

I think this PR is no longer necessary given various changes in 0.7 and 0.8, but it would in any event probably be less work to start from scratch than to resolve the merge conflicts. So I'm closing it.

@tyarkoni tyarkoni closed this Mar 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants