-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some character mappings that may need to be handled differently when searching #1970
Comments
And GitHub has a character limit on the issue text, so here's the second part: Case Alternatives (multiple codepoints)
Other Mappings Currently in Use
Punctuation and Symbols
|
And the third: Other Unsearchable Characters
Ignored Intentionally
|
Those two characters break the alternating pattern of uppercase-lowercase pairs. See issue Tatoeba#1970, section "Other Mappings Currently in Use"
The amount of work you've put into researching this is truly stunning. Are you planning to implement any changes regarding these suggestions? |
It's about time I started fixing issues rather than just piling them up. Fortunately, this one only affects a well-delineated part of the code base, so I can work on it without having to figure out how the rest of it fits together. Well, except for everything involving multiple codepoints, which will require Unicode normalization (either NFC or NFKC) to happen at some point. |
It's great that you're planning to do this work yourself. If you were going to ask someone else to do it, you would probably have to break it up and/or scale it down. |
Those two characters break the alternating pattern of uppercase-lowercase pairs. See issue Tatoeba#1970, section "Other Mappings Currently in Use"
Those two characters break the alternating pattern of uppercase-lowercase pairs. See issue #1970, section "Other Mappings Currently in Use"
The characters from U+31F0 ㇰ to U+31FF ㇿ are used to write Ainu. Unicode Block: https://www.unicode.org/charts/PDF/U31F0.pdf See issue Tatoeba#1970, section "Other Unsearchable Characters".
The characters from U+A000 ꀀ to U+A48C ꒌ are used to write Yi languages like Nuosu (iii). No Yi language has been added to Tatoeba yet, and the person who added the sentences using Yi syllables has not responded to [my attempt at making contact](https://tatoeba.org/eng/sentences/show/8191359#comment-1126914) so far. Adding the script anyway probably won't hurt. Unicode Block: http://unicode.org/charts/PDF/UA000.pdf See issue Tatoeba#1970, section "Other Unsearchable Characters".
These characters were historically used to write Javanese (jav). There are a few punctuation marks, which I have excluded. Unicode Block: https://www.unicode.org/charts/PDF/UA980.pdf See issue Tatoeba#1970, section "Other Unsearchable Characters".
A handful of characters seem to have been missed when the Lao script was added. Unicode Block: https://www.unicode.org/charts/PDF/U0E80.pdf See issue Tatoeba#1970, section "Other Unsearchable Characters".
Cuneiform is used to write Sumerian (sux). There are three Unicode blocks: - Cuneiform: https://www.unicode.org/charts/PDF/U12000.pdf - Cuneiform Numbers and Punctuation: https://www.unicode.org/charts/PDF/U12400.pdf - Early Dynastic Cuneiform: https://www.unicode.org/charts/PDF/U12480.pdf I omitted the punctuation. See issue Tatoeba#1970, section "Other Unsearchable Characters".
The characters used to write Gothic (got). The existing Gothic sentences seem to use spaces between words, so using charset_table is likely appropriate. Unicode Block: https://www.unicode.org/charts/PDF/U10330.pdf See issue Tatoeba#1970, section "Other Unsearchable Characters".
These characters are used to write Old Turkish (otk). The existing Old Turkish sentences separate words with a colon, so using charset_table is likely appropriate. Writing direction is right-to-left. Unicode Block: https://www.unicode.org/charts/PDF/U10C00.pdf See issue Tatoeba#1970, section "Other Unsearchable Characters".
Warang Citi is used by some to write Ho (hoc). According to Wikipedia, Latin punctuation including spaces is used, so using charset_table is likely appropriate. The script has a range of uppercase characters, which I mapped to the lowercase ones. Unicode Block: https://www.unicode.org/charts/PDF/U118A0.pdf See issue Tatoeba#1970, sections "Case Alternatives" and "Other Unsearchable Characters".
Mongolian script is used to write Mongolian (mon) and Manchu (mnc). The existing Mongolian sentences all seem to use Cyrillic, though. Spaces are separated with words, so using charset_table is likely appropriate. There are a few punctuation marks, which I have excluded. Traditional writing direction is vertical, but supporting that would really mess with the layout. Unicode Block: https://www.unicode.org/charts/PDF/U1800.pdf See issue Tatoeba#1970, section "Other Unsearchable Characters".
Some non-punctuation characters used to write Malayalam (mal) were missing from the end of the Unicode range. Probably a typo. Unicode Block: https://www.unicode.org/charts/PDF/U0D00.pdf See issue Tatoeba#1970, section "Other Unsearchable Characters".
Of the Spacing Modifier Letters already present in the Tatoeba corpus, - ʻ (U+2BB MODIFIER LETTER TURNED COMMA) is used to write the Hawaiian Okina, so I added it. - ʼ (U+2BC MODIFIER LETTER APOSTROPHE) is used as a stand-in for a "regular" apostrophe, so I left it out. - ʿ (U+2BF MODIFIER LETTER LEFT HALF RING) is to transcribe the Arabic Ayin, so I added it. - ˀ (U+2C0 MODIFIER LETTER GLOTTAL STOP) is used to write the glottal stop in Cayuga, so I added it. - ˈ (U+2C8 MODIFIER LETTER VERTICAL LINE) and - ˌ (U+2CC MODIFIER LETTER LOW VERTICAL LINE) are used in IPA transcriptions to mark primary and secondary stress, which I think means they're better left out. - ː (U+2D0 MODIFIER LETTER TRIANGULAR COLON) is used to mark vowel length in Ngeq, so I added it. Unicode Block: https://www.unicode.org/charts/PDF/U02B0.pdf See issue Tatoeba#1970, section "Other Unsearchable Characters".
There are five Cyrillic Unicode blocks: - Cyrillic (some combining characters in this range were missing): https://www.unicode.org/charts/PDF/U0400.pdf - Cyrillic Supplement: https://www.unicode.org/charts/PDF/U0500.pdf - Cyrillic Extended-A: https://www.unicode.org/charts/PDF/U2DE0.pdf - Cyrillic Extended-B: https://www.unicode.org/charts/PDF/UA640.pdf - Cyrillic Extended-C: https://www.unicode.org/charts/PDF/U1C80.pdf See issue Tatoeba#1970, section "Other Unsearchable Characters".
The Greek Extended character set contains accented characters used to write Ancient Greek (grc) with polytonic orthography. Unicode Block: https://www.unicode.org/charts/PDF/U1F00.pdf There were also some misalignments to fix in the Greek and Coptic block. See issue Tatoeba#1970, sections "Duplicate Encodings", "Near Duplicates", "Case Alternatives", "Other Mappings Currently in Use" and "Other Unsearchable Characters".
Most Unicode blocks containing Latin characters were already covered, but with some missing characters here and there. There are also two new Unicode blocks: - IPA Extensions: https://www.unicode.org/charts/PDF/U0250.pdf - Phonetic Extensions: https://www.unicode.org/charts/PDF/U1D00.pdf See issue Tatoeba#1970, section "Other Unsearchable Characters".
Lowercase Cherokee characters were added to Unicode in 2015. There are two Unicode blocks now: - Cherokee: https://www.unicode.org/charts/PDF/U13A0.pdf - Cherokee Supplement: https://www.unicode.org/charts/PDF/UAB70.pdf See issue Tatoeba#1970, section "Case Alternatives".
The Armenian block was previously listed as a single range. However, it contains both uppercase and lowercase characters, as well as some punctuation (which I have excluded in the updated definition). Unicode block: https://www.unicode.org/charts/PDF/U0530.pdf See issue Tatoeba#1970, section "Case Alternatives".
@Yorwba What’s the status of this issue? Are there some remaining mappings that you want to implement? Is the status of checkboxes relevant? By the way, Andreas already Unicode-normalized the sentences, but I think we still need to normalize the query text that is sent to Manticore when searching. |
The checkboxes reflect where I've created a PR or decided that the current behavior probably doesn't need to be changed. I'd been working my way up the list of "Other Unsearchable Characters", because those were mostly scripts that were missing entirely. Once I hit the missing characters from the Arabic script, things got a bit complicated. I asked a few speakers of affected languages which behavior they'd prefer and got feedback regarding Arabic, Persian and Ottoman Turkish. In the case of Arabic vowel marks, the ideal behavior would be that a word with vowel marks matches one without, but not another word with different vowel marks. But that's not a transitive relation, so it can't be implemented with a simple index lookup. Also, the set of characters that are considered equivalent is different across different languages. Parts of this would probably best be handled by a stemmer. Since we now have stemming for Arabic, the situation should have improved a bit, but I need to check. See also issues #1595 (Arabic) and #1880 (Ottoman Turkish). Unicode normalization to NFC would take care of all duplicate encodings in one fell swoop, but it seems like the cleaning function hasn't been applied to sentences already in the database. E.g. the Then there's the issue of near duplicates that are only the same in NFKC. Those are considered canonical equivalents by Unicode, but may have different appearance and are sometimes used differently. Some of them also normalize to multiple codepoints, e.g. the Dutch Some of the other parts aren't that technically difficult, but I'm not sure what the best option is. E.g. punctuation is less problematic for languages that use the I do plan to eventually get this issue fully cleaned up, but I don't have a specific timeline planned. |
Unicode Block: https://www.unicode.org/charts/PDF/U1700.pdf See issue Tatoeba#1970, section "Other Unsearchable Characters".
Unicode Block: https://www.unicode.org/charts/PDF/U13000.pdf See issue Tatoeba#1970, section "Other Unsearchable Characters".
I just started working with the data and stumbled about this unicode normalization problem. On the way I created a simple script that detects duplicate sentences. I hope this is helpful somehow. #!/usr/bin/env bash
set -Eeuo pipefail # https://vaneyckt.io/posts/safer_bash_scripts_with_set_euxo_pipefail/#:~:text=set%20%2Du,is%20often%20highly%20desirable%20behavior.
set -x # print all commands
shopt -s expand_aliases
export LC_ALL=en_US.UTF-8
# https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization
# https://www.effectiveperlprogramming.com/2011/09/normalize-your-perl-source/
alias nfd="perl -MUnicode::Normalize -CS -ne 'print NFD(\$_)'"
# Normalize different unicode space characters to the same space
# https://stackoverflow.com/a/43640405
alias normalize_spaces="perl -CSDA -plE 's/[^\\S\\t]/ /g'"
function normalize_unicode() {
cat - | normalize_spaces | nfd
}
OUT="out"
TRANS_OUT="$OUT/translations"
mkdir -p $TRANS_OUT
# https://tatoeba.org (Multilingual collaborative sentence translation database)
# https://tatoeba.org/eng/downloads
(cd "$TRANS_OUT"; wget --no-verbose --show-progress --timestamping "https://downloads.tatoeba.org/exports/sentences.tar.bz2")
(cd "$TRANS_OUT"; wget --no-verbose --show-progress --timestamping "https://downloads.tatoeba.org/exports/links.tar.bz2")
[ -s "$TRANS_OUT/sentences.tsv" ] || (tar xOjf "$TRANS_OUT/sentences.tar.bz2" sentences.csv | normalize_unicode > "$TRANS_OUT/sentences.tsv")
[ -s "$TRANS_OUT/links.tsv" ] || (tar xOjf "$TRANS_OUT/links.tar.bz2" links.csv | normalize_unicode > "$TRANS_OUT/links.tsv")
SQLITEDB="$TRANS_OUT/translations.sqlite"
if [ ! -s "$TRANS_OUT/translations.sqlite" ]; then
# some sentences referenced by links might be invalid. That's ok, because some sentences were deduplicated, for example https://tatoeba.org/eng/sentences/show/3094
rm -f "$SQLITEDB"
cat << EOF | sqlite3 -batch "$SQLITEDB"
.bail on
PRAGMA foreign_keys = ON;
SELECT "importing all sentences...";
CREATE TABLE sentences(
sentenceid INTEGER NOT NULL PRIMARY KEY,
lang TEXT NOT NULL,
sentence TEXT NOT NULL
);
CREATE INDEX sentences_sentence_lang ON sentences (lang);
CREATE INDEX sentences_sentence_sentence ON sentences (sentence);
.mode ascii
.separator "\t" "\n"
.import '$TRANS_OUT/sentences.tsv' sentences
SELECT "importing all links...";
CREATE TABLE links(
sentenceid INTEGER NOT NULL,
translationid INTEGER NOT NULL,
PRIMARY KEY (sentenceid, translationid)
FOREIGN KEY (sentenceid)
REFERENCES sentences (sentenceid)
ON UPDATE CASCADE
ON DELETE CASCADE
FOREIGN KEY (translationid)
REFERENCES sentences (sentenceid)
ON UPDATE CASCADE
ON DELETE CASCADE
);
.mode ascii
.separator "\t" "\n"
.import '$TRANS_OUT/links.tsv' links
.headers off
SELECT "vacuum...";
VACUUM;
SELECT "checking database integrity...";
PRAGMA integrity_check;
EOF
fi
# translate single sentence
# select * from sentences s JOIN links l ON s.sentenceid = l.sentenceid JOIN sentences s2 ON s2.sentenceid = l.translationid where s.:sqlite> select * from sentences s JOIN links l ON s.sentenceid = l.sentenceid JOIN sentences s2 ON s2.sentenceid = l.translationid where s.sentence='D''accord.' and s2.lang = 'deu' limit 10;
# finds duplicate sentences
sqlite3 -batch $SQLITEDB "select sentence, GROUP_CONCAT(sentenceid) from sentences GROUP BY sentence,lang HAVING COUNT(sentenceid) > 1 LIMIT 50" |
Wall thread: https://tatoeba.org/eng/wall/show_message/33106#message_33106
Alan suggested I create a GitHub ticket and add more information, so here it is. I used spoilers to hide most of the gruesome details by default and added checkboxes to each group of characters so we have some chance of keeping track of the progress that will hopefully be made.
I'd like to apologize to anyone who receives an email notification about this in a client that doesn't support the spoiler tags.
EDIT: Since GitHub doesn't like it when people post huge amounts of text in the issue description, I had to abbreviate a bit. (ex:6674905,uses:16) refers to a character appearing in 16 different sentences, one of which is 6674905.
Duplicate Encodings a.k.a. Unicode NFC
ά → ά έ → έ ή → ή ί → ί ό → ό ύ → ύ ώ → ώ Affects: Ancient Greek [grc]
不 → 不 粒 → 粒 行 → 行 Affects: Literary Chinese [lzh], Cantonese [yue]
Duplicate Encodings (multiple codepoints)
à → à á → á â → â ã → ã ä → ä ả → ả å → å ạ → ạ ć → ć ĉ → ĉ ç → ç è → è é → é ê → ê ẹ → ẹ ę → ę ĝ → ĝ ḥ → ḥ ì → ì í → í ỉ → ỉ ị → ị ĵ → ĵ ň → ň ò → ò ó → ó õ → õ ö → ö ỏ → ỏ ọ → ọ ǫ → ǫ ṛ → ṛ ŝ → ŝ ṣ → ṣ ş → ş ṭ → ṭ ù → ù ú → ú ũ → ũ ŭ → ŭ ü → ü ủ → ủ ụ → ụ ý → ý ẓ → ẓ ầ → ầ ấ → ấ ẫ → ẫ ậ → ậ ề → ề ế → ế ễ → ễ ệ → ệ ố → ố ỗ → ỗ ổ → ổ ằ → ằ ắ → ắ ẳ → ẳ ặ → ặ ờ → ờ ớ → ớ ở → ở ợ → ợ ừ → ừ ứ → ứ ữ → ữ ử → ử ự → ự Affects: Finnish [fin], Interlingue [ile], Spanish [spa], Turkmen [tuk], Russian [rus], Esperanto [epo], Swedish [swe], Yoruba [yor], Tatar [tat], Shuswap [shs], Hungarian [hun], Italian [ita], Lingala [lin], Cayuga [cay], French [fra], Vietnamese [vie], Berber [ber], Navajo [nav], Serbian [srp], Kabyle [kab], Turkish [tur]
й → й Affects: Bashkir [bak]
آ → آ أ → أ ؤ → ؤ Affects: Arabic [ara], Urdu [urd], Persian [pes]
ऱ → ऱ क़ → क़ ख़ → ख़ ग़ → ग़ ज़ → ज़ ड़ → ड़ ढ़ → ढ़ फ़ → फ़ Affects: Marathi [mar], Hindi [hin], Garhwali [gbm]
ড় → ড় ঢ় → ঢ় য় → য় Affects: Bengali [ben], Assamese [asm]
ਸ਼ → ਸ਼ ਖ਼ → ਖ਼ ਗ਼ → ਗ਼ ਜ਼ → ਜ਼ ਫ਼ → ਫ਼ Affects: Punjabi (Eastern) [pan]
ோ → ோ Affects: Tamil [tam]
ೀ → ೀ ೊ → ೊ ೋ → ೋ ೇ → ೇ Affects: Kannada [kan]
ോ → ോ Affects: Malayalam [mal]
יִ → יִ ײַ → ײַ שׂ → שׂ אַ → אַ אָ → אָ וּ → וּ כּ → כּ פּ → פּ תּ → תּ בֿ → בֿ כֿ → כֿ פֿ → פֿ Affects: Hebrew [heb], Yiddish [yid]
ָֹ → ָֹ ְּ → ְּ ֳּ → ֳּ ִּ → ִּ ֵּ → ֵּ ֶּ → ֶּ ַּ → ַּ ָּ → ָּ ֹּ → ֹּ ֻּ → ֻּ ְׁ → ְׁ ִׁ → ִׁ ֶׁ → ֶׁ ַׁ → ַׁ ָׁ → ָׁ ֹׁ → ֹׁ ֻׁ → ֻׁ ְׂ → ְׂ ִׂ → ִׂ ֵׂ → ֵׂ ָׂ → ָׂ ֹׂ → ֹׂ َّ → َّ ُّ → ُّ ِّ → ِّ ़् → ़् ့် → ့် Affects: Arabic [ara], Persian [pes], North Levantine Arabic [apc], Hindi [hin], Yiddish [yid], Hebrew [heb], Algerian Arabic [arq], Burmese [mya]
Near Duplicates a.k.a. Unicode NFKC
ª → a º → o Affects: Finnish [fin], Esperanto [epo], Lingua Franca Nova [lfn], German [deu], English [eng], Japanese [jpn], French [fra], Italian [ita], Turkish [tur], Danish [dan], Ukrainian [ukr], Spanish [spa], Interlingua [ina], Portuguese [por], Russian [rus]
⁰ → 0 ⁸ → 8 ⁿ → n Affects: Danish [dan], Russian [rus], Portuguese [por], French [fra], German [deu], Finnish [fin], Esperanto [epo], Ukrainian [ukr], Japanese [jpn], English [eng], Choctaw [cho]
₁ → 1 ₂ → 2 ₃ → 3 ₄ → 4 ₈ → 8 ₙ → n Affects: Danish [dan], Thai [tha], Esperanto [epo], Macedonian [mkd], Hungarian [hun], French [fra], Turkish [tur], Italian [ita], Czech [ces], Japanese [jpn], Dutch [nld], Finnish [fin], English [eng], Marathi [mar], Spanish [spa], Russian [rus], Kabyle [kab], Interlingua [ina], Portuguese [por], Welsh [cym], German [deu], Basque [eus], Ukrainian [ukr], Vietnamese [vie]
① → 1 ② → 2 Affects: Japanese [jpn]
𝑎 → a 𝑏 → b 𝑐 → c 𝑒 → e 𝑖 → i 𝑘 → k 𝑚 → m 𝑛 → n 𝑟 → r 𝑥 → x 𝑦 → y 𝘨 → g 𝜀 → ε 𝜋 → π Affects: Spanish [spa], Esperanto [epo], Russian [rus], German [deu]
ℎ → h Affects: German [deu]
ℵ → א Affects: German [deu]
ʰ → h ʷ → w ⵯ → ⵡ Affects: Kabyle [kab], Waray [war], Berber [ber], English [eng], Khmer [khm], Ngeq [ngt]
ſ → s Affects: Middle French [frm]
ﮐ → ک ﺋ → ئ ﺎ → ا ﺣ → ح ﺹ → ص ﻊ → ع ﻋ → ع ﻞ → ل ﻠ → ل ﻣ → م ﻪ → ه Affects: Ottoman Turkish [ota]
⺟ → 母 ⼀ → 一 ⾯ → 面 ⾷ → 食 Affects: Min Nan Chinese [nan]
µ → μ Affects: Greek [ell]
Near Duplicates (multiple codepoints)
ij → ij և → եւ fi → fi ﻹ → لإ ﻻ → لا ﻼ → لا Affects: Arabic [ara], Armenian [hye], Ottoman Turkish [ota], Dutch [nld], Irish [gle]
㌔ → キロ ㌘ → グラム Affects: Japanese [jpn]
ำ → ํา Affects: Thai [tha]
ໜ → ຫນ ໝ → ຫມ Affects: Lao [lao]
Case Alternatives a.k.a. fixed points under iterative application of Unicode NFKC, uppercasing and lowercasing using ICU
H → h I → ı J → j U → u W → w Á → á Â → â Ä → ä Å → å É → é Ú → ú Ā → ā Č → č Ē → ē Ġ → ġ Ĥ → ĥ Ī → ī İ → i ı → i Ĵ → ĵ Ļ → ļ Ľ → ľ Ł → ł Ņ → ņ Ŝ → ŝ Ū → ū ℂ → c ℃ → c ℕ → n ℝ → r Ꞌ → ꞌ 𝐴 → a 𝐵 → b 𝐾 → k 𝑁 → n 𝑋 → x Affects: Polish [pol], Finnish [fin], Ottoman Turkish [ota], English [eng], Japanese [jpn], Kashmiri [kas], Ido [ido], Dutch [nld], Danish [dan], Spanish [spa], Lojban [jbo], Portuguese [por], Russian [rus], Turkmen [tuk], Bashkir [bak], Esperanto [epo], Old East Slavic [orv], Latvian [lvs], Croatian [hrv], Talysh [tly], Latin [lat], Tatar [tat], Hungarian [hun], Unknown Language, Italian [ita], Lower Sorbian [dsb], Greek [ell], Chamorro [cha], Zaza [zza], German [deu], French [fra], Kashubian [csb], Czech [ces], Berber [ber], Slovak [slk], Navajo [nav], Upper Sorbian [hsb], Azerbaijani [aze], Turkish [tur], Crimean Tatar [crh], Chuvash [chv]
Ԑ → ԑ Affects: Kabyle [kab]
¨ → ̈ ´ → ́ ˙ → ̇ ˚ → ̊ Affects: Finnish [fin], Guarani [grn], Low German (Low Saxon) [nds], English [eng], Dutch [nld], Spanish [spa], Portuguese [por], Esperanto [epo], Old Tupi [tpw], Ukrainian [ukr], Italian [ita], Catalan [cat], Greek [ell], Mandarin Chinese [cmn], German [deu], French [fra], Czech [ces], Berber [ber], Slovak [slk], Ancient Greek [grc], Turkish [tur], Occitan [oci]
𑢩 → 𑣉 𑢮 → 𑣎 𑢯 → 𑣏 Affects: Ho [hoc]
ͅ → ι ΄ → ́ Ά → α Έ → ε Ή → ή Ί → ι Ό → ο Ύ → υ Ώ → ω ΐ → ϊ ά → α έ → ε ί → ι ς → σ ό → ο ύ → υ ώ → ω ἀ → α ἁ → α ἄ → α Ἀ → ἀ Ἄ → α Ἄ → ἄ Ἆ → ἆ ἐ → ε ἔ → ε ἕ → ε Ἐ → ἐ Ἑ → ἑ Ἓ → ε Ἓ → ἓ Ἔ → ε Ἔ → ἔ ἠ → η ἡ → η ἦ → ή Ἡ → ἡ Ἢ → ἢ Ἥ → ἥ Ἦ → ἦ ἰ → ι ἱ → ι ἶ → ι Ἰ → ἰ Ἱ → ἱ ὁ → ο ὅ → ο Ὀ → ὀ Ὁ → ο Ὁ → ὁ Ὃ → ὃ Ὄ → ὄ Ὅ → ὅ ὐ → υ ὔ → υ Ὑ → ὑ Ὕ → ὕ ὠ → ω ὡ → ω Ὡ → ὡ Ὤ → ὤ Ὦ → ὦ ὰ → α ὲ → ε έ → ε ὴ → ή ὶ → ι ί → ι ὸ → ο ὺ → υ ὼ → ω ώ → ω ᾶ → α ᾽ → ̓ ᾿ → ̓ ῆ → ή ῖ → ι ῦ → υ ῶ → ω ῾ → ̔ Affects: Ancient Greek [grc], Greek [ell], Portuguese [por]
Ա → ա Բ → բ Գ → գ Դ → դ Ե → ե Զ → զ Է → է Ը → ը Թ → թ Ժ → ժ Ի → ի Լ → լ Խ → խ Ծ → ծ Կ → կ Հ → հ Ձ → ձ Ղ → ղ Ճ → ճ Մ → մ Յ → յ Ն → ն Շ → շ Ո → ո Չ → չ Պ → պ Ջ → ջ Ս → ս Վ → վ Տ → տ Ց → ց Ւ → ւ Փ → փ Ք → ք Օ → օ Ֆ → ֆ Affects: Armenian [hye]
Ꭰ → ꭰ Ꭱ → ꭱ Ꭴ → ꭴ Ꭶ → ꭶ Ꭷ → ꭷ Ꭸ → ꭸ Ꭹ → ꭹ Ꭺ → ꭺ Ꭼ → ꭼ Ꭽ → ꭽ Ꭿ → ꭿ Ꮂ → ꮂ Ꮃ → ꮃ Ꮅ → ꮅ Ꮆ → ꮆ Ꮈ → ꮈ Ꮎ → ꮎ Ꮑ → ꮑ Ꮒ → ꮒ Ꮓ → ꮓ Ꮕ → ꮕ Ꮖ → ꮖ Ꮗ → ꮗ Ꮙ → ꮙ Ꮛ → ꮛ Ꮜ → ꮜ Ꮝ → ꮝ Ꮟ → ꮟ Ꮡ → ꮡ Ꮢ → ꮢ Ꮣ → ꮣ Ꮤ → ꮤ Ꮥ → ꮥ Ꮧ → ꮧ Ꮨ → ꮨ Ꮩ → ꮩ Ꮪ → ꮪ Ꮭ → ꮭ Ꮰ → ꮰ Ꮱ → ꮱ Ꮲ → ꮲ Ꮳ → ꮳ Ꮵ → ꮵ Ꮷ → ꮷ Ꮸ → ꮸ Ꮹ → ꮹ Ꮺ → ꮺ Ꮻ → ꮻ Ꮼ → ꮼ Ꮿ → ꮿ Ᏸ → ᏸ Ᏹ → ᏹ Ᏺ → ᏺ Ᏼ → ᏼ Affects: Cherokee [chr]
゜ → ゚ Affects: Japanese [jpn]
The text was updated successfully, but these errors were encountered: