Adds sanitizer for preventing certain tags to enter search index based on parameters #2993

biswajit-k · 2023-03-02T15:20:26Z

References #2953.
It would prevent certain tags which are only used for referencing some place to enter the search index, as they are not part of the current object also these tags wrongly populate the search index because of which sometimes actual places are pushed to lower rankings(or even removed) in search. For Example: #2652.

biswajit-k · 2023-03-03T12:36:01Z

Hi @lonvia, I have made the required changes, please have a look.

lonvia

This looks quite good already.

Your tests do not yet cover the use case, where only some of the arguments are given. For example, if the country_code parameter is not specified, then the filter should not check for country codes at all.

And thanks for fixing all my bad spelling. If you want to get it out of the way while working on the code comments, feel free to split them out in a separate PR. That can be merged quickly then.

As a general comment: you are welcome to rebase the branch for the PR at any time. So if you want to fix up or squash commits, feel free to do so.

lonvia · 2023-03-06T08:19:59Z

nominatim/tokenizer/sanitizers/delete_tags.py

+            if self.filter_kind(tag.kind):
+
+                if (obj.place.country_code in self.country_codes and
+                    self._in_rank_addresses(obj.place.rank_address)):


These two conditions apply to the whole object, not to the individual tags. That means, if they are not fulfilled, none of the tags will filtered. You can therefore have a shortcut and check those up in line 77 already.

lonvia · 2023-03-06T08:29:07Z

nominatim/tokenizer/sanitizers/delete_tags.py

+        """ Return True if the given rank address lies in any of the
+            rank address intervals. Otherwise, returns False.
+         """
+        return any(l <= rank <= r for l, r in self.rank_address_intervals)


You can do the check for rank more efficiently here. You just need to do a bit more preprocessing: we know that there are only ranks 0-30. So during preprocessing of the incoming configuration you can create a set of allowed rank numbers. That means spell out every single number to be accepted. For example, from the default configuration of '0-30' you can create a allowed_ranks = set(range(0, 31)) and then simply check obj.place.rank_address in allowed_ranks.

(Even more advanced and faster: An alternative to a set would be a tuple of 31 booleans, so that eventually the lookup is allowed_ranks[obj.place.rank_address].)

biswajit-k · 2023-03-06T16:37:09Z

Thanks for the review @lonvia. As suggested, I have raised a separate PR which fixes typos here and reverted the commit here.
I have refactored the logic for __call__ method and also for rank address. Furthermore, I have added default parameters to the sanitizer configuration and made changes in the tests accordingly. Please have a look.
Also, I will rebase the branch after no changes are required.

lonvia

Looks mostly good now.

One more tiny thing: I did not mean to completely replace the test which has all parameters set but rather add to it. So if you could bring back the test which has all parameters set and checks that things still go well, this PR is good to go.

biswajit-k · 2023-03-08T11:06:18Z

Just to confirm, now we need to add tests where all parameters are set in the sanitizer configuration.

lonvia · 2023-03-08T13:24:36Z

Yes. You now have unit tests that make sure input for each single parameter is handled correctly. That is good. It gives a good code coverage for the tests. However, just because we know the single parts work, doesn't mean that they still do the right thing when combined. Testing all possible combinations of possible parameters is too much. So the second best thing is to have a test with all parameters set, tested against a couple of meaningful data points.

…d on parameters fix: pylint error added docs for delete tags sanitizer fixed typos in docs and code comments fix: python typechecking error fixed rank address type Revert "fixed typos in docs and code comments" This reverts commit 6839eea. added default parameters and refactored code added test for all parameters

biswajit-k · 2023-03-09T09:54:24Z

Hi @lonvia, I have added the tests. Please have a look.

lonvia · 2023-03-09T13:35:56Z

Thank you!

If you are interested in a bit of a follow-up: The _matches function you have added is very similar to what config.get_filter_kind() does. If you generalise get_filter_kind(), so that it takes the name of the config parameter as an argument, it could be reused for the name and suffix checks and you could get rid of _matches.

biswajit-k · 2023-03-10T09:41:35Z

That's a great idea! I will create a new PR for this.

lonvia reviewed Mar 6, 2023

View reviewed changes

lonvia reviewed Mar 8, 2023

View reviewed changes

biswajit-k force-pushed the delete-tags branch from ed92f5d to ca149fb Compare March 9, 2023 08:51

lonvia merged commit 9a5f75d into osm-search:master Mar 9, 2023

lonvia mentioned this pull request Mar 9, 2023

New sanitizer: delete-tags #2953

Closed

biswajit-k deleted the delete-tags branch March 10, 2023 12:13

biswajit-k mentioned this pull request Mar 11, 2023

generalize filter function for sanitizers #3006

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds sanitizer for preventing certain tags to enter search index based on parameters #2993

Adds sanitizer for preventing certain tags to enter search index based on parameters #2993

biswajit-k commented Mar 2, 2023

biswajit-k commented Mar 3, 2023

lonvia left a comment

lonvia Mar 6, 2023

lonvia Mar 6, 2023

biswajit-k commented Mar 6, 2023 •

edited

Loading

lonvia left a comment

biswajit-k commented Mar 8, 2023

lonvia commented Mar 8, 2023

biswajit-k commented Mar 9, 2023

lonvia commented Mar 9, 2023

biswajit-k commented Mar 10, 2023

Adds sanitizer for preventing certain tags to enter search index based on parameters #2993

Adds sanitizer for preventing certain tags to enter search index based on parameters #2993

Conversation

biswajit-k commented Mar 2, 2023

biswajit-k commented Mar 3, 2023

lonvia left a comment

Choose a reason for hiding this comment

lonvia Mar 6, 2023

Choose a reason for hiding this comment

lonvia Mar 6, 2023

Choose a reason for hiding this comment

biswajit-k commented Mar 6, 2023 • edited Loading

lonvia left a comment

Choose a reason for hiding this comment

biswajit-k commented Mar 8, 2023

lonvia commented Mar 8, 2023

biswajit-k commented Mar 9, 2023

lonvia commented Mar 9, 2023

biswajit-k commented Mar 10, 2023

biswajit-k commented Mar 6, 2023 •

edited

Loading