-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add onlyMatching argument to syncFromSynapse to filter downloads #900
Add onlyMatching argument to syncFromSynapse to filter downloads #900
Conversation
Hello @talkdirty! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution! I made a few comments that should speed up the matching as well as simplify the changes that need to happen.
:param onlyMatching Determines list of regexes to be matched against files. | ||
Only if at least one file matches the regex, it will | ||
be downloaded. | ||
Defaults to None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of simplifying this argument to a single regex since multiple patterns can be combined using |
? My rationale is that we can use re.compile()
downstream on a single expression to be efficient. I'm guessing evaluating a single pre-compiled regex will be faster than compiling and matching a regex (which is what happens when you use re.match()
) in a for-loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could either expect a pre-compiled regex or perform the compilation in syncFromSynapse()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we think of additional use cases other than filtering on the entity name? I'm just wondering if we can generalize this parameter to being a dictionary mapping entity properties/annotations (keys) to pre-compiled regular expressions (values)?
@thomasyu888: What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shifting conversation to here: https://sagebionetworks.jira.com/browse/SYNPY-1236. @vpchung will be tackling this when she has the time.
) | ||
|
||
file_matches = True | ||
if onlyMatching is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit (i.e. optional): This can be simplified to if onlyMatching:
or perhaps more defensively as if isinstance(onlyMatching, re.Pattern)
if we expect pre-compiled regex objects.
for regex in onlyMatching: | ||
if re.match(regex, entity_meta.name) is not None: | ||
file_matches = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per my above comment, I think this part can be made more efficient if re.compile()
was used upstream on a single regex (optionally, with multiple patterns separated by |
) and then matched once here, like:
onlyMatching.match(entity_meta.name)
if re.match(regex, entity_meta.name) is not None: | ||
file_matches = True | ||
|
||
if file_matches: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need the file_matches
flag. Couldn't we simply short-circuit this function with an early return statement if onlyMatching
is set and there are no matches? This way, we avoid indenting the self._syn.get(..., downloadFile=downloadFile)
bit. We would have to run the parent_folder_sync.update()
statement you have below immediately before the return statement.
I'll happily elaborate if my suggestion isn't clear.
Thanks @talkdirty for your initial contributions. If you have no objections, we have decided to take this on by incorporating Bruno's suggestions. I will be merging this if we don't hear from you by end of next week and completing the feature by adding tests etc. |
This is to elaborate on issue #899 with a concrete PoC with desired functionality. I don't know if the approach is the desired / most efficient one!