Skip to content

Commit

Permalink
Moved KANJI_MAP to icu-rules
Browse files Browse the repository at this point in the history
  • Loading branch information
miku0 committed Jul 31, 2023
1 parent 4d61cc8 commit 67e1c7d
Show file tree
Hide file tree
Showing 3 changed files with 11 additions and 44 deletions.
29 changes: 0 additions & 29 deletions nominatim/tokenizer/sanitizers/tag_japanese.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,25 +18,6 @@
from nominatim.tokenizer.sanitizers.config import SanitizerConfig
from nominatim.data.place_name import PlaceName

KANJI_MAP = {
ord('้›ถ'): '0',
ord('ไธ€'): '1',
ord('ไบŒ'): '2',
ord('ไธ‰'): '3',
ord('ๅ››'): '4',
ord('ไบ”'): '5',
ord('ๅ…ญ'): '6',
ord('ไธƒ'): '7',
ord('ๅ…ซ'): '8',
ord('ไน'): '9'
}

def convert_kanji_sequence_to_number(sequence: str) -> str:
"""Converts Kanji numbers to Arabic numbers
"""
converted = sequence.translate(KANJI_MAP)
return converted

def create(_: SanitizerConfig) -> Callable[[ProcessInfo], None]:
"""Set up the sanitizer
"""
Expand All @@ -49,11 +30,6 @@ def reconbine_housenumber(
) -> List[PlaceName]:
""" Recombine the tag of housenumber by using housenumber and blocknumber
"""
if tmp_blocknumber:
tmp_blocknumber = convert_kanji_sequence_to_number(tmp_blocknumber)
if tmp_housenumber:
tmp_housenumber = convert_kanji_sequence_to_number(tmp_housenumber)

if tmp_blocknumber and tmp_housenumber:
new_address.append(
PlaceName(
Expand Down Expand Up @@ -87,11 +63,6 @@ def reconbine_place(
) -> List[PlaceName]:
""" Recombine the tag of place by using neighbourhood and quarter
"""
if tmp_neighbourhood:
tmp_neighbourhood = convert_kanji_sequence_to_number(tmp_neighbourhood)
if tmp_quarter:
tmp_quarter = convert_kanji_sequence_to_number(tmp_quarter)

if tmp_neighbourhood and tmp_quarter:
new_address.append(
PlaceName(
Expand Down
22 changes: 11 additions & 11 deletions settings/icu-rules/unicode-digits-to-decimal.yaml
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
- "[๐žฅ๐’ ฿€๐–ญ๊ค€๐–ฉ ๐‘“๐‘‘๐‘‹ฐ๐‘„ถ๊ฉ๊˜ แฑ€แญแฎฐแ แŸ แฅ†เผ เป๊งฐแ‚แชแช€แง๐‘ต๊ฏฐแฑ๐‘ฑ๐‘œฐ๐‘›€๐‘™๐‘‡๊ง๊ฃเทฆ๐‘ฆ๏ผ๐Ÿถ๐Ÿ˜๐Ÿฌ๐ŸŽ๐Ÿขโ‚€โ“ฟโ“ชโฐ] > 0"
- "[๐žฅ‘๐’ก฿๐–ญ‘๊ค๐–ฉก๐‘“‘๐‘‘‘๐‘‹ฑ๐‘„ท๊ฉ‘๊˜กแฑแญ‘แฎฑแ ‘แŸกแฅ‡เผกเป‘๊งฑแ‚‘แช‘แชแง‘๐‘ต‘๊ฏฑแฑ‘๐‘ฑ‘๐‘œฑ๐‘›๐‘™‘๐‘‡‘๊ง‘๊ฃ‘เทง๐‘ง๏ผ‘๐Ÿท๐Ÿ™๐Ÿญ๐Ÿ๐Ÿฃโ‚ยนโ‘ โ‘ดโ’ˆโถโž€โžŠโ“ต] > 1"
- "[๐žฅ’๐’ข฿‚๐–ญ’๊ค‚๐–ฉข๐‘“’๐‘‘’๐‘‹ฒ๐‘„ธ๊ฉ’๊˜ขแฑ‚แญ’แฎฒแ ’แŸขแฅˆเผขเป’๊งฒแ‚’แช’แช‚แง’๐‘ต’๊ฏฒแฑ’๐‘ฑ’๐‘œฒ๐‘›‚๐‘™’๐‘‡’๊ง’๊ฃ’เทจ๐‘จ๏ผ’๐Ÿธ๐Ÿš๐Ÿฎ๐Ÿ๐Ÿคโ‚‚ยฒโ‘กโ‘ตโ’‰โทโžโž‹โ“ถ] > 2"
- "[๐žฅ“๐’ฃ฿ƒ๐–ญ“๊คƒ๐–ฉฃ๐‘““๐‘‘“๐‘‹ณ๐‘„น๊ฉ“๊˜ฃแฑƒแญ“แฎณแ “แŸฃแฅ‰เผฃเป“๊งณแ‚“แช“แชƒแง“๐‘ต“๊ฏณแฑ“๐‘ฑ“๐‘œณ๐‘›ƒ๐‘™“๐‘‡“๊ง“๊ฃ“เทฉ๐‘ฉ๏ผ“๐Ÿน๐Ÿ›๐Ÿฏ๐Ÿ‘๐Ÿฅโ‚ƒยณโ‘ขโ‘ถโ’Šโธโž‚โžŒโ“ท] > 3"
- "[๐žฅ”๐’ค฿„๐–ญ”๊ค„๐–ฉค๐‘“”๐‘‘”๐‘‹ด๐‘„บ๊ฉ”๊˜คแฑ„แญ”แฎดแ ”แŸคแฅŠเผคเป”๊งดแ‚”แช”แช„แง”๐‘ต”๊ฏดแฑ”๐‘ฑ”๐‘œด๐‘›„๐‘™”๐‘‡”๊ง”๊ฃ”เทช๐‘ช๏ผ”๐Ÿบ๐Ÿœ๐Ÿฐ๐Ÿ’๐Ÿฆโ‚„โดโ‘ฃโ‘ทโ’‹โนโžƒโžโ“ธ] > 4"
- "[๐žฅ•๐’ฅ฿…๐–ญ•๊ค…๐–ฉฅ๐‘“•๐‘‘•๐‘‹ต๐‘„ป๊ฉ•๊˜ฅแฑ…แญ•แฎตแ •แŸฅแฅ‹เผฅเป•๊งตแ‚•แช•แช…แง•๐‘ต•๊ฏตแฑ•๐‘ฑ•๐‘œต๐‘›…๐‘™•๐‘‡•๊ง•๊ฃ•เทซ๐‘ซ๏ผ•๐Ÿป๐Ÿ๐Ÿฑ๐Ÿ“๐Ÿงโ‚…โตโ‘คโ‘ธโ’Œโบโž„โžŽโ“น] > 5"
- "[๐žฅ–๐’ฆ฿†๐–ญ–๊ค†๐–ฉฆ๐‘“–๐‘‘–๐‘‹ถ๐‘„ผ๊ฉ–๊˜ฆแฑ†แญ–แฎถแ –แŸฆแฅŒเผฆเป–๊งถแ‚–แช–แช†แง–๐‘ต–๊ฏถแฑ–๐‘ฑ–๐‘œถ๐‘›†๐‘™–๐‘‡–๊ง–๊ฃ–เทฌ๐‘ฌ๏ผ–๐Ÿผ๐Ÿž๐Ÿฒ๐Ÿ”๐Ÿจโ‚†โถโ‘ฅโ‘นโ’โปโž…โžโ“บ] > 6"
- "[๐žฅ—๐’ง฿‡๐–ญ—๊ค‡๐–ฉง๐‘“—๐‘‘—๐‘‹ท๐‘„ฝ๊ฉ—๊˜งแฑ‡แญ—แฎทแ —แŸงแฅเผงเป—๊งทแ‚—แช—แช‡แง—๐‘ต—๊ฏทแฑ—๐‘ฑ—๐‘œท๐‘›‡๐‘™—๐‘‡—๊ง—๊ฃ—เทญ๐‘ญ๏ผ—๐Ÿฝ๐ŸŸ๐Ÿณ๐Ÿ•๐Ÿฉโ‚‡โทโ‘ฆโ‘บโ’Žโผโž†โžโ“ป] > 7"
- "[๐žฅ˜๐’จ฿ˆ๐–ญ˜๊คˆ๐–ฉจ๐‘“˜๐‘‘˜๐‘‹ธ๐‘„พ๊ฉ˜๊˜จแฑˆแญ˜แฎธแ ˜แŸจแฅŽเผจเป˜๊งธแ‚˜แช˜แชˆแง˜๐‘ต˜๊ฏธแฑ˜๐‘ฑ˜๐‘œธ๐‘›ˆ๐‘™˜๐‘‡˜๊ง˜๊ฃ˜เทฎ๐‘ฎ๏ผ˜๐Ÿพ๐Ÿ ๐Ÿด๐Ÿ–๐Ÿชโ‚ˆโธโ‘งโ‘ปโ’โฝโž‡โž‘โ“ผ] > 8"
- "[๐žฅ™๐’ฉ฿‰๐–ญ™๊ค‰๐–ฉฉ๐‘“™๐‘‘™๐‘‹น๐‘„ฟ๊ฉ™๊˜ฉแฑ‰แญ™แฎนแ ™แŸฉแฅเผฉเป™๊งนแ‚™แช™แช‰แง™๐‘ต™๊ฏนแฑ™๐‘ฑ™๐‘œน๐‘›‰๐‘™™๐‘‡™๊ง™๊ฃ™เทฏ๐‘ฏ๏ผ™๐Ÿฟ๐Ÿก๐Ÿต๐Ÿ—๐Ÿซโ‚‰โนโ‘จโ‘ผโ’โพโžˆโž’โ“ฝ] > 9"
- "[๐‘œบโ‘ฉโ‘ฝโ’‘โฟโž‰โž“โ“พ] > '10'"
- "[๐žฅ๐’ ฿€๐–ญ๊ค€๐–ฉ ๐‘“๐‘‘๐‘‹ฐ๐‘„ถ๊ฉ๊˜ แฑ€แญแฎฐแ แŸ แฅ†เผ เป๊งฐแ‚แชแช€แง๐‘ต๊ฏฐแฑ๐‘ฑ๐‘œฐ๐‘›€๐‘™๐‘‡๊ง๊ฃเทฆ๐‘ฆ๏ผ๐Ÿถ๐Ÿ˜๐Ÿฌ๐ŸŽ๐Ÿขโ‚€โ“ฟโ“ชโฐ้›ถ] > 0"
- "[๐žฅ‘๐’ก฿๐–ญ‘๊ค๐–ฉก๐‘“‘๐‘‘‘๐‘‹ฑ๐‘„ท๊ฉ‘๊˜กแฑแญ‘แฎฑแ ‘แŸกแฅ‡เผกเป‘๊งฑแ‚‘แช‘แชแง‘๐‘ต‘๊ฏฑแฑ‘๐‘ฑ‘๐‘œฑ๐‘›๐‘™‘๐‘‡‘๊ง‘๊ฃ‘เทง๐‘ง๏ผ‘๐Ÿท๐Ÿ™๐Ÿญ๐Ÿ๐Ÿฃโ‚ยนโ‘ โ‘ดโ’ˆโถโž€โžŠโ“ตไธ€] > 1"
- "[๐žฅ’๐’ข฿‚๐–ญ’๊ค‚๐–ฉข๐‘“’๐‘‘’๐‘‹ฒ๐‘„ธ๊ฉ’๊˜ขแฑ‚แญ’แฎฒแ ’แŸขแฅˆเผขเป’๊งฒแ‚’แช’แช‚แง’๐‘ต’๊ฏฒแฑ’๐‘ฑ’๐‘œฒ๐‘›‚๐‘™’๐‘‡’๊ง’๊ฃ’เทจ๐‘จ๏ผ’๐Ÿธ๐Ÿš๐Ÿฎ๐Ÿ๐Ÿคโ‚‚ยฒโ‘กโ‘ตโ’‰โทโžโž‹โ“ถไบŒ] > 2"
- "[๐žฅ“๐’ฃ฿ƒ๐–ญ“๊คƒ๐–ฉฃ๐‘““๐‘‘“๐‘‹ณ๐‘„น๊ฉ“๊˜ฃแฑƒแญ“แฎณแ “แŸฃแฅ‰เผฃเป“๊งณแ‚“แช“แชƒแง“๐‘ต“๊ฏณแฑ“๐‘ฑ“๐‘œณ๐‘›ƒ๐‘™“๐‘‡“๊ง“๊ฃ“เทฉ๐‘ฉ๏ผ“๐Ÿน๐Ÿ›๐Ÿฏ๐Ÿ‘๐Ÿฅโ‚ƒยณโ‘ขโ‘ถโ’Šโธโž‚โžŒโ“ทไธ‰] > 3"
- "[๐žฅ”๐’ค฿„๐–ญ”๊ค„๐–ฉค๐‘“”๐‘‘”๐‘‹ด๐‘„บ๊ฉ”๊˜คแฑ„แญ”แฎดแ ”แŸคแฅŠเผคเป”๊งดแ‚”แช”แช„แง”๐‘ต”๊ฏดแฑ”๐‘ฑ”๐‘œด๐‘›„๐‘™”๐‘‡”๊ง”๊ฃ”เทช๐‘ช๏ผ”๐Ÿบ๐Ÿœ๐Ÿฐ๐Ÿ’๐Ÿฆโ‚„โดโ‘ฃโ‘ทโ’‹โนโžƒโžโ“ธๅ››] > 4"
- "[๐žฅ•๐’ฅ฿…๐–ญ•๊ค…๐–ฉฅ๐‘“•๐‘‘•๐‘‹ต๐‘„ป๊ฉ•๊˜ฅแฑ…แญ•แฎตแ •แŸฅแฅ‹เผฅเป•๊งตแ‚•แช•แช…แง•๐‘ต•๊ฏตแฑ•๐‘ฑ•๐‘œต๐‘›…๐‘™•๐‘‡•๊ง•๊ฃ•เทซ๐‘ซ๏ผ•๐Ÿป๐Ÿ๐Ÿฑ๐Ÿ“๐Ÿงโ‚…โตโ‘คโ‘ธโ’Œโบโž„โžŽโ“นไบ”] > 5"
- "[๐žฅ–๐’ฆ฿†๐–ญ–๊ค†๐–ฉฆ๐‘“–๐‘‘–๐‘‹ถ๐‘„ผ๊ฉ–๊˜ฆแฑ†แญ–แฎถแ –แŸฆแฅŒเผฆเป–๊งถแ‚–แช–แช†แง–๐‘ต–๊ฏถแฑ–๐‘ฑ–๐‘œถ๐‘›†๐‘™–๐‘‡–๊ง–๊ฃ–เทฌ๐‘ฌ๏ผ–๐Ÿผ๐Ÿž๐Ÿฒ๐Ÿ”๐Ÿจโ‚†โถโ‘ฅโ‘นโ’โปโž…โžโ“บๅ…ญ] > 6"
- "[๐žฅ—๐’ง฿‡๐–ญ—๊ค‡๐–ฉง๐‘“—๐‘‘—๐‘‹ท๐‘„ฝ๊ฉ—๊˜งแฑ‡แญ—แฎทแ —แŸงแฅเผงเป—๊งทแ‚—แช—แช‡แง—๐‘ต—๊ฏทแฑ—๐‘ฑ—๐‘œท๐‘›‡๐‘™—๐‘‡—๊ง—๊ฃ—เทญ๐‘ญ๏ผ—๐Ÿฝ๐ŸŸ๐Ÿณ๐Ÿ•๐Ÿฉโ‚‡โทโ‘ฆโ‘บโ’Žโผโž†โžโ“ปไธƒ] > 7"
- "[๐žฅ˜๐’จ฿ˆ๐–ญ˜๊คˆ๐–ฉจ๐‘“˜๐‘‘˜๐‘‹ธ๐‘„พ๊ฉ˜๊˜จแฑˆแญ˜แฎธแ ˜แŸจแฅŽเผจเป˜๊งธแ‚˜แช˜แชˆแง˜๐‘ต˜๊ฏธแฑ˜๐‘ฑ˜๐‘œธ๐‘›ˆ๐‘™˜๐‘‡˜๊ง˜๊ฃ˜เทฎ๐‘ฎ๏ผ˜๐Ÿพ๐Ÿ ๐Ÿด๐Ÿ–๐Ÿชโ‚ˆโธโ‘งโ‘ปโ’โฝโž‡โž‘โ“ผๅ…ซ] > 8"
- "[๐žฅ™๐’ฉ฿‰๐–ญ™๊ค‰๐–ฉฉ๐‘“™๐‘‘™๐‘‹น๐‘„ฟ๊ฉ™๊˜ฉแฑ‰แญ™แฎนแ ™แŸฉแฅเผฉเป™๊งนแ‚™แช™แช‰แง™๐‘ต™๊ฏนแฑ™๐‘ฑ™๐‘œน๐‘›‰๐‘™™๐‘‡™๊ง™๊ฃ™เทฏ๐‘ฏ๏ผ™๐Ÿฟ๐Ÿก๐Ÿต๐Ÿ—๐Ÿซโ‚‰โนโ‘จโ‘ผโ’โพโžˆโž’โ“ฝไน] > 9"
- "[๐‘œบโ‘ฉโ‘ฝโ’‘โฟโž‰โž“โ“พๅ] > '10'"
- "[โ‘ชโ‘พโ’’โ“ซ] > '11'"
- "[โ‘ซโ‘ฟโ’“โ“ฌ] > '12'"
- "[โ‘ฌโ’€โ’”โ“ญ] > '13'"
Expand Down
4 changes: 0 additions & 4 deletions test/python/tokenizer/sanitizers/test_tag_japanese.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,3 @@ def test_housenumber_quarter(self):
def test_housenumber_blocknumber_neighbourhood_quarter(self):
res = self.run_sanitizer_on('address', block_number='6', housenumber='2', quarter='kase', neighbourhood='8')
assert res == [('6-2','housenumber'),('kase8','place')]

def test_KANJI_MAP(self):
res = self.run_sanitizer_on('address', block_number='ๅ…ญ', housenumber='ไบŒ', quarter='kase', neighbourhood='ๅ…ซ')
assert res == [('6-2','housenumber'),('kase8','place')]

0 comments on commit 67e1c7d

Please sign in to comment.