divvun-normaliser #44

snomos · 2021-03-04T19:04:27Z

Draft specification here.

Tasks:

Add support for analyser
Add support for generator
Add support for normaliser
Add support for tag filtering
Proper output formatting
store the original lemma in a tag string in the same reading, replacing it with the normalized lemma
Normalize each CG sub-reading separately, like phonemisation #58

snomos · 2021-03-04T19:07:19Z

The folllowing works fine without divvun-normaliser:

echo 'Man vuoras: 23' | hfst-tokenise -g tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst \
| vislcg3 -g tools/tokenisers/mwe-dis.bin | cg-mwesplit 
"<Man>"
	"Man" N Prop Sem/Plc Sg Nom <W:0.0>
	"Man" N Prop Sem/Sur Sg Nom <W:0.0>
	"man" Adv <W:0.0>
	"mij" Pron Interr Sg Gen <W:0.0>
	"mij" Pron Interr Sg Ill Attr <W:0.0>
	"mij" Pron Interr Sg Ine Attr <W:0.0>
	"mij" Pron Rel Sg Gen <W:0.0>
	"mij" Pron Rel Sg Ill Attr <W:0.0>
	"mij" Pron Rel Sg Ine Attr <W:0.0>
: 
"<vuoras>"
	"vuoras" A Attr <W:0.0>
	"vuoras" A Sg Nom <W:0.0>
	"vuoras" Err/Orth A Attr <W:0.0>
	"vuoras" Err/Orth A Sg Nom <W:0.0>
	"vuorrat" Ex/V IV Der/st V Ind Prs Err/Orth Sg3 <W:0.0>
	"vuorrat" Ex/V IV Der/st V Ind Prs Sg3 <W:0.0>
"<:>"
	":" CLB <W:0.0>
: 
"<23>"
	"23" A Arab Ord Attr CLBfinal <W:0.0>
	"23" Num Arab Sg Ela Attr <W:0.0>
	"23" Num Arab Sg Gen <W:0.0>
	"23" Num Arab Sg Ill Attr <W:0.0>
	"23" Num Arab Sg Ine Attr <W:0.0>
	"23" Num Arab Sg Nom <W:0.0>
	"23" Num Sem/ID <W:0.0>
:\n

But with divvun-normaliser I get a lidivvun error (and not the expected output format):

echo 'Man vuoras: 23' | hfst-tokenise -g tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst \
| vislcg3 -g tools/tokenisers/mwe-dis.bin \
| cg-mwesplit \
| divvun-normaliser -a src/analyser-gt-desc.hfst -n tools/tts/transcriptor-gt-desc.hfst -g src/generator-gt-norm.hfst 
libdivvun: ERROR: HfstException.
"<Man>"
: 
"<vuoras>"
"<:>"
: 
"<23>"
:\n

flammie · 2021-03-05T03:35:35Z

It seems I didn't manage to set the default for -t tags so it didn't print nothing, now it should copy input if no tags are set to be expanded.

flammie · 2021-03-05T14:32:47Z

pushed few more debugging; it seems we need hfstol's to lookup_fd:

echo 'Man vuoras: 23' | hfst-tokenise -g ~/github/giellalt/lang-smj/tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst | vislcg3 -g ~/github/giellalt/lang-smj/tools/tokenisers/mwe-dis.bin | cg-mwesplit | src/divvun-normaliser -a ~/github/giellalt/lang-smj/src/analyser-gt-desc.hfstol -n ~/github/giellalt/lang-smj/tools/tts/transcriptor-gt-desc.hfstol -g ~/github/giellalt/lang-smj/src/generator-gt-norm.hfstol --tags Arab -v
libdivvun: ERROR: HfstException: Exception: NotTransducerStreamException: transducer type not recognised in file: HfstInputStream.cc on line: 1088
Read /home/flammie/github/giellalt/lang-smj/tools/tts/transcriptor-gt-desc.hfstol, /home/flammie/github/giellalt/lang-smj/src/generator-gt-norm.hfstol, /home/flammie/github/giellalt/lang-smj/src/analyser-gt-desc.hfstol
"<Man>"
	"Man" N Prop Sem/Plc Sg Nom <W:0.0>
	"Man" N Prop Sem/Sur Sg Nom <W:0.0>
	"man" Adv <W:0.0>
	"mij" Pron Interr Sg Gen <W:0.0>
	"mij" Pron Interr Sg Ill Attr <W:0.0>
	"mij" Pron Interr Sg Ine Attr <W:0.0>
	"mij" Pron Rel Sg Gen <W:0.0>
	"mij" Pron Rel Sg Ill Attr <W:0.0>
	"mij" Pron Rel Sg Ine Attr <W:0.0>
: 
"<vuoras>"
	"vuoras" A Attr <W:0.0>
	"vuoras" A Sg Nom <W:0.0>
	"vuoras" Err/Orth A Attr <W:0.0>
	"vuoras" Err/Orth A Sg Nom <W:0.0>
	"vuorrat" Ex/V IV Der/st V Ind Prs Err/Orth Sg3 <W:0.0>
	"vuorrat" Ex/V IV Der/st V Ind Prs Sg3 <W:0.0>
"<:>"
	":" CLB <W:0.0>
: 
"<23>"
	"guaktalåkgålmmå" <W:0.0> "guaktalåkgålmmå"phon
		"23" A Arab Ord Attr CLBfinal <W:0.0>
	"guaktalåkgålmmå" <W:0.0> "guaktalåkgålmmå"phon
		"23" Num Arab Sg Ela Attr <W:0.0>
	"guaktalåkgålmmå" <W:0.0> "guaktalåkgålmmå"phon
		"23" Num Arab Sg Gen <W:0.0>
	"guaktalåkgålmmå" <W:0.0> "guaktalåkgålmmå"phon
		"23" Num Arab Sg Ill Attr <W:0.0>
	"guaktalåkgålmmå" <W:0.0> "guaktalåkgålmmå"phon
		"23" Num Arab Sg Ine Attr <W:0.0>
	"guaktalåkgålmmå" <W:0.0> "guaktalåkgålmmå"phon
		"23" Num Arab Sg Nom <W:0.0>
	"23" Num Sem/ID <W:0.0>
:\n

snomos · 2021-03-05T14:50:19Z

Nice progress 🙂

@unhammer are there any CG syntax restrictions on the transcripted string, "guaktalåkgålmmå"phon in the test case above? We modelled it after the divvun-cgspell output, but that one has only one letter after the actual string. Just asking to avoid major changes later 🙂

TinoDidriksen · 2021-03-05T18:23:19Z

"guaktalåkgålmmå"phon is a valid CG tag, though it is not considered a textual tag - not that I think that matters for you. The rule is that if it starts with " then include anything to next " and from there include to next whitespace. This avoids much unnecessary escaping.

snomos · 2022-04-14T06:04:43Z

A case we haven't considered: dynamic compounds, ie cohorts with sub-readings. There are two considerations:

we create subreadings out of the original - the normalized reading is the main reading, the original is stored in a subreading
in dynamic compounds, we may want to normalize each part separately, as in:

echo 1800-lågon | ./tools/tts/modes/smj-txt2ipa.mode 
"<1800-lågon>"
	"lågos" N Sem/Dummytag Ess <W:0.0> @HNOUN #1->0 "1800-lɔkon"phon
		"1800" Num Cmp/Hyph Cmp <W:0.0> #1->0 "1800-lɔkon"phon
	"låhko" N Sem/Amount Sg Ine <W:0.0> @HNOUN #1->0 "1800-lɔkon"phon
		"1800" Num Cmp/Hyph Cmp <W:0.0> #1->0 "1800-lɔkon"phon
	"lågos" N Sem/Dummytag Ess <W:0.0> @HNOUN #1->0 "1800-lɔkon"phon
		"1800" Num Cmp/OblHyph Cmp <W:0.0> #1->0 "1800-lɔkon"phon
	"låhko" N Sem/Amount Sg Ine <W:0.0> @HNOUN #1->0 "1800-lɔkon"phon
		"1800" Num Cmp/OblHyph Cmp <W:0.0> #1->0 "1800-lɔkon"phon
:\n

If we could normalize 1800- independently of the rest of the compound, we would solve a lot of corner cases.

Perhaps the best solution would be to not change the basic cohort structure at all, ie that we do NOT add the original lemma as a subreading. Instead I suggest that we store the original in a tag string along the lines of the "abc"phon string, something like: "1800-"orig or "1800-"olemma or something similar. The main purpose of retaining the original lemma is for debugging, and changing the cohort structure seems to cost too much.

@flammie could you have a look at this? I added the new tasks to the task list in the initial comment.

snomos assigned flammie Mar 4, 2021

snomos mentioned this issue Mar 6, 2021

Transcriptor gjev tekst som ikkje kan analyserast giellalt/lang-smj#5

Closed

snomos mentioned this issue Sep 12, 2023

Normalize each CG sub-reading separately, like phonemisation #58

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

divvun-normaliser #44

divvun-normaliser #44

snomos commented Mar 4, 2021 •

edited

Loading

snomos commented Mar 4, 2021 •

edited

Loading

flammie commented Mar 5, 2021

flammie commented Mar 5, 2021

snomos commented Mar 5, 2021

TinoDidriksen commented Mar 5, 2021

snomos commented Apr 14, 2022 •

edited

Loading

divvun-normaliser #44

divvun-normaliser #44

Comments

snomos commented Mar 4, 2021 • edited Loading

snomos commented Mar 4, 2021 • edited Loading

flammie commented Mar 5, 2021

flammie commented Mar 5, 2021

snomos commented Mar 5, 2021

TinoDidriksen commented Mar 5, 2021

snomos commented Apr 14, 2022 • edited Loading

snomos commented Mar 4, 2021 •

edited

Loading

snomos commented Mar 4, 2021 •

edited

Loading

snomos commented Apr 14, 2022 •

edited

Loading