Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds sanitizer for Japanese addresses to correspond to block address #3122

Merged
merged 7 commits into from
Aug 1, 2023

Conversation

miku0
Copy link
Contributor

@miku0 miku0 commented Jul 26, 2023

This sanitizer is designed for the "Block address system" used in Japan. Previously, certain locations were referred to by streets, but with this change, they will now be referenced by blocks, based on the block address system.

For example, the building at W891898846 used to be referred to as "Minamikase Chūō-dōri" (highway:territory), but with this sanitizer, it will be properly referenced as "Minamikase 3-chome" (place:neighbourhood). This update ensures accurate and consistent location references in the search index.


def reconbine_housenumber(
new_address: List[PlaceName],
tmp_housenumber: str | None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still need to use the 'old' syntax Optional[str] here, so we can support older Python versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def convert_kanji_sequence_to_number(sequence: str) -> str:
"""Converts Kanji numbers to Arabic numbers
"""
kanji_map = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a constant which could be reused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don' t think this is what was meant here. Rather move ' kanji_map' outside the function as a module global variable. Then it is initialised only once. Please note, that the variable name should be then in uppercase letters for being a constant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry. I was mistaken.
I moved and capitalized.

@DB
Feature: Searches in Japan
Test specifically for searches of Japanese addresses and in Japanese language.
Scenario: A block house-number is parented to the neighbourhood
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that your test needs a sanitizer, it is natural that it cannot work with the legacy tokenizer system that does not know about sanitizers. Simply add a @fail-legacy above the scenario.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much. I understood.
I added @fail-legacy.

Copy link
Member

@lonvia lonvia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some comments for smaller cosmetic issues.

What I really do miss are tests for the additional functionality of reconbine_place().

return converted

def create(_: SanitizerConfig) -> Callable[[ProcessInfo], None]:
#def create(config: SanitizerConfig) -> Callable[[ProcessInfo],None]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove all commented-out code.

new_address.append(
PlaceName(
kind='place',
name=f'{tmp_neighbourhood}',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format string is unnecessary here. Same below.

new_address.append(item)

new_address = reconbine_housenumber(new_address,tmp_housenumber,tmp_blocknumber)
new_address = reconbine_place(new_address,tmp_neighbourhood,tmp_quarter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please leave spaces after commas.

item.name = convert_kanji_sequence_to_number(item.name)

for item in obj.address:
item.name = convert_kanji_sequence_to_number(item.name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've given that some more thought and doing it for all name and address items might cause trouble because there is no such conversion when searching. Better to do this only when creating the combined place name in line 100.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are cases where all four tags used for combining will contain KANJI, so I adapted the function to all four, not just the 100th line. https://github.com/miku0/Nominatim/blob/4d61cc87cff9ecf5c90741a94c4b93da2f12c5ad/nominatim/tokenizer/sanitizers/tag_japanese.py#L90
Is this appropriate?

current_number = ''
converted += char
converted += current_number
return converted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a look at the Python function translate. It does a similar thing, as far as I can see. You just need to be careful that with translate, the mapping table needs unicode numbers as keys. So you need to define the table like this: KANJI_MAP = { ord('零'): '0', ....}.

Comment on lines +46 to +48
def test_quarter_neighbourhood(self):
res = self.run_sanitizer_on('address', quarter='kase', neighbourhood='8')
assert res == [('kase8','place')]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for your comment.
I have added a new test for reconbine_place() after this line.

Comment on lines +1 to +11
- "[𞥐𐒠߀𖭐꤀𖩠𑓐𑑐𑋰𑄶꩐꘠᱀᭐᮰᠐០᥆༠໐꧰႐᪐᪀᧐𑵐꯰᱐𑱐𑜰𑛀𑙐𑇐꧐꣐෦𑁦0𝟶𝟘𝟬𝟎𝟢₀⓿⓪⁰] > 0"
- "[𞥑𐒡߁𖭑꤁𖩡𑓑𑑑𑋱𑄷꩑꘡᱁᭑᮱᠑១᥇༡໑꧱႑᪑᪁᧑𑵑꯱᱑𑱑𑜱𑛁𑙑𑇑꧑꣑෧𑁧1𝟷𝟙𝟭𝟏𝟣₁¹①⑴⒈❶➀➊⓵] > 1"
- "[𞥒𐒢߂𖭒꤂𖩢𑓒𑑒𑋲𑄸꩒꘢᱂᭒᮲᠒២᥈༢໒꧲႒᪒᪂᧒𑵒꯲᱒𑱒𑜲𑛂𑙒𑇒꧒꣒෨𑁨2𝟸𝟚𝟮𝟐𝟤₂²②⑵⒉❷➁➋⓶] > 2"
- "[𞥓𐒣߃𖭓꤃𖩣𑓓𑑓𑋳𑄹꩓꘣᱃᭓᮳᠓៣᥉༣໓꧳႓᪓᪃᧓𑵓꯳᱓𑱓𑜳𑛃𑙓𑇓꧓꣓෩𑁩3𝟹𝟛𝟯𝟑𝟥₃³③⑶⒊❸➂➌⓷] > 3"
- "[𞥔𐒤߄𖭔꤄𖩤𑓔𑑔𑋴𑄺꩔꘤᱄᭔᮴᠔៤᥊༤໔꧴႔᪔᪄᧔𑵔꯴᱔𑱔𑜴𑛄𑙔𑇔꧔꣔෪𑁪4𝟺𝟜𝟰𝟒𝟦₄⁴④⑷⒋❹➃➍⓸] > 4"
- "[𞥕𐒥߅𖭕꤅𖩥𑓕𑑕𑋵𑄻꩕꘥᱅᭕᮵᠕៥᥋༥໕꧵႕᪕᪅᧕𑵕꯵᱕𑱕𑜵𑛅𑙕𑇕꧕꣕෫𑁫5𝟻𝟝𝟱𝟓𝟧₅⁵⑤⑸⒌❺➄➎⓹] > 5"
- "[𞥖𐒦߆𖭖꤆𖩦𑓖𑑖𑋶𑄼꩖꘦᱆᭖᮶᠖៦᥌༦໖꧶႖᪖᪆᧖𑵖꯶᱖𑱖𑜶𑛆𑙖𑇖꧖꣖෬𑁬6𝟼𝟞𝟲𝟔𝟨₆⁶⑥⑹⒍❻➅➏⓺] > 6"
- "[𞥗𐒧߇𖭗꤇𖩧𑓗𑑗𑋷𑄽꩗꘧᱇᭗᮷᠗៧᥍༧໗꧷႗᪗᪇᧗𑵗꯷᱗𑱗𑜷𑛇𑙗𑇗꧗꣗෭𑁭7𝟽𝟟𝟳𝟕𝟩₇⁷⑦⑺⒎❼➆➐⓻] > 7"
- "[𞥘𐒨߈𖭘꤈𖩨𑓘𑑘𑋸𑄾꩘꘨᱈᭘᮸᠘៨᥎༨໘꧸႘᪘᪈᧘𑵘꯸᱘𑱘𑜸𑛈𑙘𑇘꧘꣘෮𑁮8𝟾𝟠𝟴𝟖𝟪₈⁸⑧⑻⒏❽➇➑⓼] > 8"
- "[𞥙𐒩߉𖭙꤉𖩩𑓙𑑙𑋹𑄿꩙꘩᱉᭙᮹᠙៩᥏༩໙꧹႙᪙᪉᧙𑵙꯹᱙𑱙𑜹𑛉𑙙𑇙꧙꣙෯𑁯9𝟿𝟡𝟵𝟗𝟫₉⁹⑨⑼⒐❾➈➒⓽] > 9"
- "[𑜺⑩⑽⒑❿➉➓⓾] > '10'"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the function to change Japanese Kanji numbers to Arabic numbers here.
"nominatim search --query '(Addresses containing Kanji numbers)' --format debug" This debug command will confirm that the number has been changed.

@lonvia lonvia merged commit 252fe42 into osm-search:master Aug 1, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants