-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
divvun-normaliser #44
Comments
The folllowing works fine without echo 'Man vuoras: 23' | hfst-tokenise -g tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst \
| vislcg3 -g tools/tokenisers/mwe-dis.bin | cg-mwesplit
"<Man>"
"Man" N Prop Sem/Plc Sg Nom <W:0.0>
"Man" N Prop Sem/Sur Sg Nom <W:0.0>
"man" Adv <W:0.0>
"mij" Pron Interr Sg Gen <W:0.0>
"mij" Pron Interr Sg Ill Attr <W:0.0>
"mij" Pron Interr Sg Ine Attr <W:0.0>
"mij" Pron Rel Sg Gen <W:0.0>
"mij" Pron Rel Sg Ill Attr <W:0.0>
"mij" Pron Rel Sg Ine Attr <W:0.0>
:
"<vuoras>"
"vuoras" A Attr <W:0.0>
"vuoras" A Sg Nom <W:0.0>
"vuoras" Err/Orth A Attr <W:0.0>
"vuoras" Err/Orth A Sg Nom <W:0.0>
"vuorrat" Ex/V IV Der/st V Ind Prs Err/Orth Sg3 <W:0.0>
"vuorrat" Ex/V IV Der/st V Ind Prs Sg3 <W:0.0>
"<:>"
":" CLB <W:0.0>
:
"<23>"
"23" A Arab Ord Attr CLBfinal <W:0.0>
"23" Num Arab Sg Ela Attr <W:0.0>
"23" Num Arab Sg Gen <W:0.0>
"23" Num Arab Sg Ill Attr <W:0.0>
"23" Num Arab Sg Ine Attr <W:0.0>
"23" Num Arab Sg Nom <W:0.0>
"23" Num Sem/ID <W:0.0>
:\n But with echo 'Man vuoras: 23' | hfst-tokenise -g tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst \
| vislcg3 -g tools/tokenisers/mwe-dis.bin \
| cg-mwesplit \
| divvun-normaliser -a src/analyser-gt-desc.hfst -n tools/tts/transcriptor-gt-desc.hfst -g src/generator-gt-norm.hfst
libdivvun: ERROR: HfstException.
"<Man>"
:
"<vuoras>"
"<:>"
:
"<23>"
:\n |
It seems I didn't manage to set the default for |
pushed few more debugging; it seems we need hfstol's to lookup_fd:
|
Nice progress 🙂 @unhammer are there any CG syntax restrictions on the transcripted string, |
|
A case we haven't considered: dynamic compounds, ie cohorts with sub-readings. There are two considerations:
echo 1800-lågon | ./tools/tts/modes/smj-txt2ipa.mode
"<1800-lågon>"
"lågos" N Sem/Dummytag Ess <W:0.0> @HNOUN #1->0 "1800-lɔkon"phon
"1800" Num Cmp/Hyph Cmp <W:0.0> #1->0 "1800-lɔkon"phon
"låhko" N Sem/Amount Sg Ine <W:0.0> @HNOUN #1->0 "1800-lɔkon"phon
"1800" Num Cmp/Hyph Cmp <W:0.0> #1->0 "1800-lɔkon"phon
"lågos" N Sem/Dummytag Ess <W:0.0> @HNOUN #1->0 "1800-lɔkon"phon
"1800" Num Cmp/OblHyph Cmp <W:0.0> #1->0 "1800-lɔkon"phon
"låhko" N Sem/Amount Sg Ine <W:0.0> @HNOUN #1->0 "1800-lɔkon"phon
"1800" Num Cmp/OblHyph Cmp <W:0.0> #1->0 "1800-lɔkon"phon
:\n If we could normalize Perhaps the best solution would be to not change the basic cohort structure at all, ie that we do NOT add the original lemma as a subreading. Instead I suggest that we store the original in a tag string along the lines of the @flammie could you have a look at this? I added the new tasks to the task list in the initial comment. |
Draft specification here.
Tasks:
The text was updated successfully, but these errors were encountered: