Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New sanitizer: delete-tags #2953

Closed
lonvia opened this issue Jan 22, 2023 · 7 comments
Closed

New sanitizer: delete-tags #2953

lonvia opened this issue Jan 22, 2023 · 7 comments

Comments

@lonvia
Copy link
Member

lonvia commented Jan 22, 2023

In order to filter out odd tagging practises like the postbox refs in Germany
(see #2652) we could do with a sanitizer that removes some names and address parts from
the list of indexed names. It should work along those lines:

Sanitizer 'delete-tags'

Removes a name or address name from the list, when it matches all properties
given in the parameter list. There are two kinds of properties. Some refer to
the name and some refer to the place being indexed. If you specify only
place properties, then all names and address parts will be deleted and the
place will effectively not be searchable.

Where parameters contain regular expressions, they are always matched against
the full string. Add an exclamation mark (!) in front of the expression to
negate the match, i.e. delete the tag only if the property does not match.

When a parameter contains a list of conditions, a single match is sufficient.

Parameters:

 type - Either 'name' or 'address'.
 kind - A single string or a list of strings containing regular
        expressions that are matched against the 'kind' property of the name.
 suffix - A single string or a list of strings containing regular
          expressions that are matched against the 'suffix' property.
 name - A single string or a list of strings containing regular
        expressions that are matched against 'name' property of the name.
 country_code - A single string or list of strings containing two-letter
                lower-case country codes. Only places within one of the listed
                countries will be deleted.
 rank_address - A single string or list of strings containing either a single
                number referring to a rank or a range of the form <from>-<to>.
@lonvia
Copy link
Member Author

lonvia commented Jan 22, 2023

If somebody wants to take this up, here are some hints to get you started:

@SainiAditya1
Copy link

assign this to me , I would like to work on this ..

@lonvia
Copy link
Member Author

lonvia commented Jan 30, 2023

We do not assign tasks. Just pick what interests you. Please make sure to read the GSOC instructions.

@biswajit-k
Copy link
Contributor

Hi @lonvia, I was going through the above issue, couldn't clearly get what type means, is it refering to the names and address lists on the ProcessInfo object or the type field on the PlaceInfo object(if so, then what does name and address mean there)?

@lonvia
Copy link
Member Author

lonvia commented Mar 2, 2023

The type indeed is meant to refer to the names and address lists in the ProcessInfo object. Depending on the type the function should clean out one or the other list.

@biswajit-k
Copy link
Contributor

Thanks for clarifying @lonvia, I have added a PR for the same. I request you to please have a look.

@lonvia
Copy link
Member Author

lonvia commented Mar 9, 2023

Implemented in #2993.

@lonvia lonvia closed this as completed Mar 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants