Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds sanitizer for preventing certain tags to enter search index based on parameters #2993

Merged
merged 1 commit into from
Mar 9, 2023

Conversation

biswajit-k
Copy link
Contributor

References #2953.
It would prevent certain tags which are only used for referencing some place to enter the search index, as they are not part of the current object also these tags wrongly populate the search index because of which sometimes actual places are pushed to lower rankings(or even removed) in search. For Example: #2652.

@biswajit-k
Copy link
Contributor Author

Hi @lonvia, I have made the required changes, please have a look.

Copy link
Member

@lonvia lonvia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks quite good already.

Your tests do not yet cover the use case, where only some of the arguments are given. For example, if the country_code parameter is not specified, then the filter should not check for country codes at all.

And thanks for fixing all my bad spelling. If you want to get it out of the way while working on the code comments, feel free to split them out in a separate PR. That can be merged quickly then.

As a general comment: you are welcome to rebase the branch for the PR at any time. So if you want to fix up or squash commits, feel free to do so.

if self.filter_kind(tag.kind):

if (obj.place.country_code in self.country_codes and
self._in_rank_addresses(obj.place.rank_address)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two conditions apply to the whole object, not to the individual tags. That means, if they are not fulfilled, none of the tags will filtered. You can therefore have a shortcut and check those up in line 77 already.

""" Return True if the given rank address lies in any of the
rank address intervals. Otherwise, returns False.
"""
return any(l <= rank <= r for l, r in self.rank_address_intervals)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do the check for rank more efficiently here. You just need to do a bit more preprocessing: we know that there are only ranks 0-30. So during preprocessing of the incoming configuration you can create a set of allowed rank numbers. That means spell out every single number to be accepted. For example, from the default configuration of '0-30' you can create a allowed_ranks = set(range(0, 31)) and then simply check obj.place.rank_address in allowed_ranks.

(Even more advanced and faster: An alternative to a set would be a tuple of 31 booleans, so that eventually the lookup is allowed_ranks[obj.place.rank_address].)

@biswajit-k
Copy link
Contributor Author

biswajit-k commented Mar 6, 2023

Thanks for the review @lonvia. As suggested, I have raised a separate PR which fixes typos here and reverted the commit here.
I have refactored the logic for __call__ method and also for rank address. Furthermore, I have added default parameters to the sanitizer configuration and made changes in the tests accordingly. Please have a look.
Also, I will rebase the branch after no changes are required.

Copy link
Member

@lonvia lonvia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good now.

One more tiny thing: I did not mean to completely replace the test which has all parameters set but rather add to it. So if you could bring back the test which has all parameters set and checks that things still go well, this PR is good to go.

@biswajit-k
Copy link
Contributor Author

Just to confirm, now we need to add tests where all parameters are set in the sanitizer configuration.

@lonvia
Copy link
Member

lonvia commented Mar 8, 2023

Yes. You now have unit tests that make sure input for each single parameter is handled correctly. That is good. It gives a good code coverage for the tests. However, just because we know the single parts work, doesn't mean that they still do the right thing when combined. Testing all possible combinations of possible parameters is too much. So the second best thing is to have a test with all parameters set, tested against a couple of meaningful data points.

…d on parameters

fix: pylint error

added docs for delete tags sanitizer

fixed typos in docs and code comments

fix: python typechecking error

fixed rank address type

Revert "fixed typos in docs and code comments"

This reverts commit 6839eea.

added default parameters and refactored code

added test for all parameters
@biswajit-k
Copy link
Contributor Author

Hi @lonvia, I have added the tests. Please have a look.

@lonvia lonvia merged commit 9a5f75d into osm-search:master Mar 9, 2023
@lonvia
Copy link
Member

lonvia commented Mar 9, 2023

Thank you!

If you are interested in a bit of a follow-up: The _matches function you have added is very similar to what config.get_filter_kind() does. If you generalise get_filter_kind(), so that it takes the name of the config parameter as an argument, it could be reused for the name and suffix checks and you could get rid of _matches.

@biswajit-k
Copy link
Contributor Author

That's a great idea! I will create a new PR for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants