Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add onlyMatching argument to syncFromSynapse to filter downloads #900

Merged

Conversation

georggrab
Copy link

This is to elaborate on issue #899 with a concrete PoC with desired functionality. I don't know if the approach is the desired / most efficient one!

@pep8speaks
Copy link

Hello @talkdirty! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 246:121: E501 line too long (133 > 120 characters)
Line 264:121: E501 line too long (125 > 120 characters)
Line 272:1: W293 blank line contains whitespace
Line 327:121: E501 line too long (126 > 120 characters)

@thomasyu888 thomasyu888 requested a review from a team March 3, 2022 11:22
Copy link

@BrunoGrandePhD BrunoGrandePhD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! I made a few comments that should speed up the matching as well as simplify the changes that need to happen.

Comment on lines +72 to +75
:param onlyMatching Determines list of regexes to be matched against files.
Only if at least one file matches the regex, it will
be downloaded.
Defaults to None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of simplifying this argument to a single regex since multiple patterns can be combined using |? My rationale is that we can use re.compile() downstream on a single expression to be efficient. I'm guessing evaluating a single pre-compiled regex will be faster than compiling and matching a regex (which is what happens when you use re.match()) in a for-loop.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could either expect a pre-compiled regex or perform the compilation in syncFromSynapse().

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we think of additional use cases other than filtering on the entity name? I'm just wondering if we can generalize this parameter to being a dictionary mapping entity properties/annotations (keys) to pre-compiled regular expressions (values)?

@thomasyu888: What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shifting conversation to here: https://sagebionetworks.jira.com/browse/SYNPY-1236. @vpchung will be tackling this when she has the time.

)

file_matches = True
if onlyMatching is not None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (i.e. optional): This can be simplified to if onlyMatching: or perhaps more defensively as if isinstance(onlyMatching, re.Pattern) if we expect pre-compiled regex objects.

Comment on lines +280 to +282
for regex in onlyMatching:
if re.match(regex, entity_meta.name) is not None:
file_matches = True

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per my above comment, I think this part can be made more efficient if re.compile() was used upstream on a single regex (optionally, with multiple patterns separated by |) and then matched once here, like:

onlyMatching.match(entity_meta.name)

if re.match(regex, entity_meta.name) is not None:
file_matches = True

if file_matches:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need the file_matches flag. Couldn't we simply short-circuit this function with an early return statement if onlyMatching is set and there are no matches? This way, we avoid indenting the self._syn.get(..., downloadFile=downloadFile) bit. We would have to run the parent_folder_sync.update() statement you have below immediately before the return statement.

I'll happily elaborate if my suggestion isn't clear.

@thomasyu888 thomasyu888 changed the base branch from develop to SYNPY-1236-add-regex-filter August 5, 2022 19:53
@thomasyu888
Copy link
Member

thomasyu888 commented Aug 5, 2022

Thanks @talkdirty for your initial contributions. If you have no objections, we have decided to take this on by incorporating Bruno's suggestions. I will be merging this if we don't hear from you by end of next week and completing the feature by adding tests etc.

@thomasyu888 thomasyu888 merged commit 4aac54e into Sage-Bionetworks:SYNPY-1236-add-regex-filter Aug 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants