Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generalize filter function for sanitizers #3006

Merged
merged 1 commit into from
Apr 11, 2023

Conversation

biswajit-k
Copy link
Contributor

As discussed here filter functions are commonly used by sanitizers to process only selected tags of places.

So, instead of having separate filters for each parameter of sanitizer configuration, we can have a generalized filter function get_filter() which takes the parameter name as one of the arguments and creates a filter function for it.

@biswajit-k
Copy link
Contributor Author

Hi @lonvia, could you please review the PR? Thanks.

or not self.allowed_ranks[obj.place.rank_address]
or self.has_country_code
and obj.place.country_code not in self.country_codes
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I have to complain about style here. I usually use a slash for multilines:

        if not tags \
            or not self.allowed_ranks[obj.place.rank_address] \
            or self.has_country_code \
            and obj.place.country_code not in self.country_codes:

We really should add this to the style guide.


assert filt(kind)


@pytest.mark.parametrize('kind', ('de ', '123', '', 'bedece'))
@pytest.mark.parametrize("kind", ("de ", "123", "", "bedece", None))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please go with the existing quoting style.

The convention for quotes is:

  • use single quotes for all simple strings unless the string contains a single string (so, it should be 'London' but "London's subway")
  • use double quotes for format strings (i.e. f"London {foo}")

I know that this convention is not followed everywhere but we should at least get it consistent in new code.

filt = SanitizerConfig({'filter-kind': '.*de'}).get_filter_kind()
filt = SanitizerConfig(
{
"name": "abc",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this additional parameter needed?

@biswajit-k
Copy link
Contributor Author

Hi @lonvia, I have made the required changes. Also, noted the suggestions for the future. Thanks!

Copy link
Member

@lonvia lonvia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, two more small questions.

@@ -289,7 +289,7 @@ def run_sanitizer_on(self, country_code, rank_addr, suffix, **kwargs):

def test_string_arguments_pass(self):
res = self.run_sanitizer_on('de', '25-30', r'[\s\S]*',
name='foo', ref='foo', name_abc='bar', ref_abc='baz')
name_xyz='foo', ref_pqr='foo', name_abc='bar', ref_abc='baz')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain why this needed to be changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because of the functionality of get_filter function. If we want all suffixes to pass, we should ideally not provide the suffix parameter. If we provide a suffix parameter, it will not allow empty suffixes. This is because the function treats none and empty string as different.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. The None case is really hard to wrap your head around. I think for non-programmers it is easier to understand if an absent suffix is handled like the empty string. It would also make the code in get_filter easier because the function can go back to accepting str only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this. For these scenarios, we can consider None value as an empty string safely. I'll add the changes.

"""
filters = self.get_string_list('filter-kind', default)
values = self.data.get(param, None)
filters = self.get_string_list(param) if values is not None else default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation to change the semantics of get_string_list()? It just makes the code here a lot more complicated. Without the change, you can simply leave the code here untouched and it works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, here we want to check if both values and default are none so I am explicitly checking that. With the current definition of get_string_list, it should not return none, because type checking would complain.

And in case we allow it to also return none, we would explicitly have to check for none values wherever we use it. Also, I think it would be good idea to let the function only return a list as it would make the function return value be more predictable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The None behaviour was the intention, the type annotation is wrong. It's odd that mypy never caught that error.

I see your point about having to check for None when using the function. However, once this PR is merged, I see only two uses of the function: here and for the country codes in the delete-tags sanitizer. Both would benefit from a None return. For the country_code it could replace the explicit has_country_code variable.

Oh well, up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could be done, but I would prefer to be it like this.

"""
filters = self.get_string_list('filter-kind', default)
values = self.data.get(param, None)
filters = self.get_string_list(param) if values is not None else default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The None behaviour was the intention, the type annotation is wrong. It's odd that mypy never caught that error.

I see your point about having to check for None when using the function. However, once this PR is merged, I see only two uses of the function: here and for the country codes in the delete-tags sanitizer. Both would benefit from a None return. For the country_code it could replace the explicit has_country_code variable.

Oh well, up to you.


if not filters:
if filters is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the behaviour when the parameter is present but empty. It's an odd edge case but the idea was to have it behave as if the parameter was not present at all.

Copy link
Contributor Author

@biswajit-k biswajit-k Mar 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. This is less intuitive unless explicitly mentioned. What I can think of is - we can do either of two things-

  1. Go with these changes and let [] keep its meaning i.e allow nothing. Because to allow anything to pass we can just skip the parameter, which ideally makes sense.
  2. We can revert back to the original code and explicitly mention that [] is just like skipping the parameter. Also with that, if I allow get_string_list to take None default and return an empty list for it, we could bring back the get_string_list(param, default) thing. But it will not be usable for a few cases like in clean_housenumber, an empty list for convert-to-name tag actually means no names.

@lonvia please let me know what you think and then I will give it a go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to untangle this. Problem here is that [] appears in two very different cases:

The first case is that somebody has given an empty field in yaml, e.g.

- step: clean-housenumbers
  filter-kind:

This results in param=''´ which is converted to [] by get_String_list. This is the case where I think it should behave as if the parameter was missing.

The second case is that a default of [] is given and the parameter is omitted. In your current implementation this is equivalent to 'let none pass', while default=None results in 'let all pass'.

So the actual fix for the first case (which I meant in my comment) is to change a line further up:

filters = self.get_string_list(param) if values else default

That said, it would be a good idea to be more explicit in the default parameter. Allow default=PASS_ALL, default=FAIL_ALL and default=[<list of regexes>]. Do not allow None or empty lists.

@biswajit-k
Copy link
Contributor Author

Hi @lonvia, I have made the following changes-

  1. Treat suffix as an empty string when it is None
  2. Treat parameter value as missing when it is omitted or its value is missing in yaml
  3. Add explicit values to the default parameter

Request you to please have a look.

@lonvia lonvia merged commit 5751686 into osm-search:master Apr 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants