-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH]: exclude common non-BIDS data sources + evade potential named tuple problem #277
Conversation
# A set is extended with non-BIDS-defined common suspects | ||
# which are not expected to carry any data of interest | ||
excludes = {"code", "stimuli", "sourcedata", "models", "derivatives", | ||
".git", ".datalad", ".cache", ".workdir", "workdir"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than hard-coding these, I think it's reasonable to expect users to maintain a .bidsignore
file, as these should also cause validator issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> cat .bidsignore
inputs/
workdir/
eyetrack/
README.md
model-frst_smdl.json
CHANGELOG.md
this is what I have in my .bidsignore, files in workdir were still considered.
@yarikoptic remarks: "Would it ever makes sense to consider .git files?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree .git
should never be considered. I'm not sure .bidsignore
reading is implemented. That would be a valuable contribution. But I also think sensible hardcoded defaults (such as these) make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please update the docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to specify sensible defaults, I'd be inclined to disallow all .*
files except .bidsignore
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason these particular values are hardcoded is that they're explicitly mentioned in the spec as places people should put stuff, but we don't want to index them by default. The other directories (e.g., .git) are not part of the spec, so I don't want to give them special handling. I we start adding things like workdir/
, there's no limit to what people might want to add.
I agree that .bidsignore
support is the right way to handle this. Pybids currently passes (in kwargs
) an exclude
argument onto grabbit, which one can use to pass arbitrary regexes that (I believe) are matched against every file. So probably the easiest way to implement .bidsignore
support would be to add entries in .bidsignore
to the exclude
list. There may be unintended consequences, however, so this should be investigated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, but I think excluding .*
folders by default is probably a good idea (when would it be a good idea to scan them in?) Other folders can go into a bidsignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to specify sensible defaults, I'd be inclined to disallow all
.*
files except.bidsignore
.
Just curious -- what for to allow/consider .bidsignore? if any function would be needing to treat this this file, it would just test for its presence and open it. It doesn't need it to be a part of the Layout schema/pool of files, since there is nothing to query for in its name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other directories (e.g., .git) are not part of the spec, so I don't want to give them special handling. I we start adding things like
workdir/
, there's no limit to what people might want to add.I agree that
.bidsignore
support is the right way to handle this.
I agree too! Support of .bidsignore would allow for all those workdir
etc. But also I agree with @adelavega -- I do not think in any BIDS spec we would/should allow/be interested in files/directories starting with .
for the purpose of paths matching etc.
As I have pointed above, .bidsignore
is a special file which is not part of the path layout template matching and actually not a part of the BIDS specification itself! Its existence was instigated within bids-validator, so it somewhat of a tooling level file. So it should IMHO be ok to ignore all .
files/dirs in general for pattern matching. I think it might even be worth to "hardcode" such consideration in BIDS specification itself? e.g.
"Note on BIDS dataset tools support/development: No directories or files in BIDS are generally named with leading .
and thus should be ignored for the purpose of discovery of relevant data files. A .bidsignore
file can be used by dataset users to inform tools of additional directories/files which should also be ignored by the tools. The format of .bidsignore
file is .."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed—let's add "^\."
to the list of default excludes (if I'm not mistaken, they're regexes by default) and remove the other entries (i.e., workdir
and anything that starts with .
). We can add .bidsignore
support in a separate PR, unless someone wants to take a pass at it here. It shouldn't be difficult—unless I'm missing something, I think just adding all the lines in that file to the internally-constructed exclude
list that gets passed onto grabbit's Layout
initializer should do the trick.
I don't have strong feelings about whether or not the spec should mention .
and .bidsignore
. I guess I lean towards not for the time being; to my mind the place to put that would be some kind of developer guide or appendix.
Codecov Report
@@ Coverage Diff @@
## master #277 +/- ##
==========================================
- Coverage 71.16% 71.07% -0.09%
==========================================
Files 24 24
Lines 2507 2510 +3
Branches 609 610 +1
==========================================
Hits 1784 1784
- Misses 558 561 +3
Partials 165 165
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hard-coded list should only include directories mentioned in the spec. See comments.
@@ -563,18 +566,24 @@ def index_file(self, f, overwrite=False): | |||
an entry already exists. | |||
""" | |||
if isinstance(f, six.string_types): | |||
f = self.layout.get_file(f) | |||
filename = self.layout.get_file(f).path | |||
elif isinstance(f, tuple): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the motivation for this? For methods like this, I'd prefer to minimize the number of valid types that can be passed as input. It's cleaner if the caller has to first extract a string or BIDSFile
, I think. Plus, long term, I'd like to get rid of the namedtuple entirely—I think it mostly just adds confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Motivation was pointed to below this line: http://github.com/grabbles/grabbit/pull/84 . In the effort of using fitlins with current pybids we ended up here with the File
tuple which has no .path
(see that issue for our past discussion). Since our attempts are largely shots in the darkness trying to maneuver between fluid APIs, I did not want to try to change more than minimally necessary and this change was such a minimal change to stay compatible possibly with older and newer pybids API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yarikoptic How would you feel about getting rid of the namedtuple entirely, and just returning the File
object by default? The main functional difference is that entity names are attributes in the tuple, but not in the File
class. But for the sake of backward compatibility, I'd be okay with hacking getattr
to check in the .entities
dict before raising an exception. That way I think nobody's existing code would break, and we would reduce code complexity. @effigies @adelavega feel free to chime in also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see no problems with it, as long as its backwards compatible. It does seem a bit redundant and confusing two have two types of objects that can be returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened #281 to deal with this.
# A set is extended with non-BIDS-defined common suspects | ||
# which are not expected to carry any data of interest | ||
excludes = {"code", "stimuli", "sourcedata", "models", "derivatives", | ||
".git", ".datalad", ".cache", ".workdir", "workdir"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason these particular values are hardcoded is that they're explicitly mentioned in the spec as places people should put stuff, but we don't want to index them by default. The other directories (e.g., .git) are not part of the spec, so I don't want to give them special handling. I we start adding things like workdir/
, there's no limit to what people might want to add.
I agree that .bidsignore
support is the right way to handle this. Pybids currently passes (in kwargs
) an exclude
argument onto grabbit, which one can use to pass arbitrary regexes that (I believe) are matched against every file. So probably the easiest way to implement .bidsignore
support would be to add entries in .bidsignore
to the exclude
list. There may be unintended consequences, however, so this should be investigated.
The namedtuple is now deprecated, and a |
Oh, right, there's still the outstanding question of how to handle exclusions. I think my suggestion to add |
I think ignoring |
I think this PR is no longer necessary given various changes in 0.7 and 0.8, but it would in any event probably be less work to start from scratch than to resolve the merge conflicts. So I'm closing it. |
This PR