-
-
Notifications
You must be signed in to change notification settings - Fork 714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds sanitizer for Japanese addresses to correspond to block address #3122
Conversation
|
||
def reconbine_housenumber( | ||
new_address: List[PlaceName], | ||
tmp_housenumber: str | None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You still need to use the 'old' syntax Optional[str]
here, so we can support older Python versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/miku0/Nominatim/blob/sanitizer-final/nominatim/tokenizer/sanitizers/tag_japanese.py#L31
I changed it to Optional[str].
def convert_kanji_sequence_to_number(sequence: str) -> str: | ||
"""Converts Kanji numbers to Arabic numbers | ||
""" | ||
kanji_map = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a constant which could be reused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nominatim/tokenizer/sanitizers/kanji_utils.py
I split it into new files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don' t think this is what was meant here. Rather move ' kanji_map' outside the function as a module global variable. Then it is initialised only once. Please note, that the variable name should be then in uppercase letters for being a constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry. I was mistaken.
I moved and capitalized.
@DB | ||
Feature: Searches in Japan | ||
Test specifically for searches of Japanese addresses and in Japanese language. | ||
Scenario: A block house-number is parented to the neighbourhood |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that your test needs a sanitizer, it is natural that it cannot work with the legacy tokenizer system that does not know about sanitizers. Simply add a @fail-legacy
above the scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much. I understood.
I added @fail-legacy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some comments for smaller cosmetic issues.
What I really do miss are tests for the additional functionality of reconbine_place()
.
return converted | ||
|
||
def create(_: SanitizerConfig) -> Callable[[ProcessInfo], None]: | ||
#def create(config: SanitizerConfig) -> Callable[[ProcessInfo],None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove all commented-out code.
new_address.append( | ||
PlaceName( | ||
kind='place', | ||
name=f'{tmp_neighbourhood}', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The format string is unnecessary here. Same below.
new_address.append(item) | ||
|
||
new_address = reconbine_housenumber(new_address,tmp_housenumber,tmp_blocknumber) | ||
new_address = reconbine_place(new_address,tmp_neighbourhood,tmp_quarter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please leave spaces after commas.
item.name = convert_kanji_sequence_to_number(item.name) | ||
|
||
for item in obj.address: | ||
item.name = convert_kanji_sequence_to_number(item.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've given that some more thought and doing it for all name and address items might cause trouble because there is no such conversion when searching. Better to do this only when creating the combined place name in line 100.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are cases where all four tags used for combining will contain KANJI, so I adapted the function to all four, not just the 100th line. https://github.com/miku0/Nominatim/blob/4d61cc87cff9ecf5c90741a94c4b93da2f12c5ad/nominatim/tokenizer/sanitizers/tag_japanese.py#L90
Is this appropriate?
current_number = '' | ||
converted += char | ||
converted += current_number | ||
return converted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have a look at the Python function translate. It does a similar thing, as far as I can see. You just need to be careful that with translate, the mapping table needs unicode numbers as keys. So you need to define the table like this: KANJI_MAP = { ord('零'): '0', ....}
.
def test_quarter_neighbourhood(self): | ||
res = self.run_sanitizer_on('address', quarter='kase', neighbourhood='8') | ||
assert res == [('kase8','place')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for your comment.
I have added a new test for reconbine_place() after this line.
- "[𞥐𐒠߀𖭐꤀𖩠𑓐𑑐𑋰𑄶꩐꘠᱀᭐᮰᠐០᥆༠໐꧰႐᪐᪀᧐𑵐꯰᱐𑱐𑜰𑛀𑙐𑇐꧐꣐෦𑁦0𝟶𝟘𝟬𝟎𝟢₀⓿⓪⁰零] > 0" | ||
- "[𞥑𐒡߁𖭑꤁𖩡𑓑𑑑𑋱𑄷꩑꘡᱁᭑᮱᠑១᥇༡໑꧱႑᪑᪁᧑𑵑꯱᱑𑱑𑜱𑛁𑙑𑇑꧑꣑෧𑁧1𝟷𝟙𝟭𝟏𝟣₁¹①⑴⒈❶➀➊⓵一] > 1" | ||
- "[𞥒𐒢߂𖭒꤂𖩢𑓒𑑒𑋲𑄸꩒꘢᱂᭒᮲᠒២᥈༢໒꧲႒᪒᪂᧒𑵒꯲᱒𑱒𑜲𑛂𑙒𑇒꧒꣒෨𑁨2𝟸𝟚𝟮𝟐𝟤₂²②⑵⒉❷➁➋⓶二] > 2" | ||
- "[𞥓𐒣߃𖭓꤃𖩣𑓓𑑓𑋳𑄹꩓꘣᱃᭓᮳᠓៣᥉༣໓꧳႓᪓᪃᧓𑵓꯳᱓𑱓𑜳𑛃𑙓𑇓꧓꣓෩𑁩3𝟹𝟛𝟯𝟑𝟥₃³③⑶⒊❸➂➌⓷三] > 3" | ||
- "[𞥔𐒤߄𖭔꤄𖩤𑓔𑑔𑋴𑄺꩔꘤᱄᭔᮴᠔៤᥊༤໔꧴႔᪔᪄᧔𑵔꯴᱔𑱔𑜴𑛄𑙔𑇔꧔꣔෪𑁪4𝟺𝟜𝟰𝟒𝟦₄⁴④⑷⒋❹➃➍⓸四] > 4" | ||
- "[𞥕𐒥߅𖭕꤅𖩥𑓕𑑕𑋵𑄻꩕꘥᱅᭕᮵᠕៥᥋༥໕꧵႕᪕᪅᧕𑵕꯵᱕𑱕𑜵𑛅𑙕𑇕꧕꣕෫𑁫5𝟻𝟝𝟱𝟓𝟧₅⁵⑤⑸⒌❺➄➎⓹五] > 5" | ||
- "[𞥖𐒦߆𖭖꤆𖩦𑓖𑑖𑋶𑄼꩖꘦᱆᭖᮶᠖៦᥌༦໖꧶႖᪖᪆᧖𑵖꯶᱖𑱖𑜶𑛆𑙖𑇖꧖꣖෬𑁬6𝟼𝟞𝟲𝟔𝟨₆⁶⑥⑹⒍❻➅➏⓺六] > 6" | ||
- "[𞥗𐒧߇𖭗꤇𖩧𑓗𑑗𑋷𑄽꩗꘧᱇᭗᮷᠗៧᥍༧໗꧷႗᪗᪇᧗𑵗꯷᱗𑱗𑜷𑛇𑙗𑇗꧗꣗෭𑁭7𝟽𝟟𝟳𝟕𝟩₇⁷⑦⑺⒎❼➆➐⓻七] > 7" | ||
- "[𞥘𐒨߈𖭘꤈𖩨𑓘𑑘𑋸𑄾꩘꘨᱈᭘᮸᠘៨᥎༨໘꧸႘᪘᪈᧘𑵘꯸᱘𑱘𑜸𑛈𑙘𑇘꧘꣘෮𑁮8𝟾𝟠𝟴𝟖𝟪₈⁸⑧⑻⒏❽➇➑⓼八] > 8" | ||
- "[𞥙𐒩߉𖭙꤉𖩩𑓙𑑙𑋹𑄿꩙꘩᱉᭙᮹᠙៩᥏༩໙꧹႙᪙᪉᧙𑵙꯹᱙𑱙𑜹𑛉𑙙𑇙꧙꣙෯𑁯9𝟿𝟡𝟵𝟗𝟫₉⁹⑨⑼⒐❾➈➒⓽九] > 9" | ||
- "[𑜺⑩⑽⒑❿➉➓⓾十] > '10'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed the function to change Japanese Kanji numbers to Arabic numbers here.
"nominatim search --query '(Addresses containing Kanji numbers)' --format debug" This debug command will confirm that the number has been changed.
This sanitizer is designed for the "Block address system" used in Japan. Previously, certain locations were referred to by streets, but with this change, they will now be referenced by blocks, based on the block address system.
For example, the building at W891898846 used to be referred to as "Minamikase Chūō-dōri" (highway:territory), but with this sanitizer, it will be properly referenced as "Minamikase 3-chome" (place:neighbourhood). This update ensures accurate and consistent location references in the search index.