Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TTS: 200 som tekst blir ikkje generert i akkusativ #35

Closed
snomos opened this issue Oct 23, 2023 · 22 comments
Closed

TTS: 200 som tekst blir ikkje generert i akkusativ #35

snomos opened this issue Oct 23, 2023 · 22 comments
Assignees
Labels
bug Something isn't working

Comments

@snomos
Copy link
Member

snomos commented Oct 23, 2023

I denne setninga:

Dát máhtto de mak jåvsåj Finnmárko sámijda suláj 200 jage maŋŋela Kristusa riegádime.

blir 200 disambiguert til akkusativ:

echo 'Dát máhtto de mak jåvsåj Finnmárko sámijda suláj 200 jage maŋŋela Kristusa riegádime.' | \
  ./tools/tts/modes/trace-smj-normaliser8-cg.mode
[...]
"<200>"
        "200" Num Arab Err/Orth Sg Acc "200>"MIDTAPE <W:0.0> SELECT:1387:Arab SELECT:2936 MAP:1357:>nNum @>N #9->10 SETPARENT:866:SetModToN
;       "200" Num Arab Err/Orth Ess "200>"MIDTAPE <W:0.0> SELECT:1387:Arab SELECT:2936
;       "200" Num Arab Err/Orth Sg Com "200>"MIDTAPE <W:0.0> SELECT:1387:Arab SELECT:2936
;       "200" Num Arab Sg Ela Attr "200"MIDTAPE <W:0.0> SELECT:1387:Arab IFF:3194
;       "200" Num Arab Sg Gen "200>"MIDTAPE <W:0.0> SELECT:1387:Arab SELECT:2936
;       "200" Num Arab Sg Ill Attr "200"MIDTAPE <W:0.0> SELECT:1387:Arab IFF:3194
;       "200" Num Arab Sg Ine Attr "200"MIDTAPE <W:0.0> SELECT:1387:Arab IFF:3194
;       "200" Num Arab Sg Nom "200>"MIDTAPE <W:0.0> SELECT:1387:Arab REMOVE:3396
;       "200" Num Sem/ID "200"MIDTAPE <W:0.0> SELECT:1387:Arab

Men den genererte ordforma er ikkje i akkusativ, ho er i nominativ i phon-elementet:

"<200>"
	"guoktatjuodát" A Ord Sg Nom "guoktatjuodát"phon
		"200" Num Arab Err/Orth Sg Acc "200>"MIDTAPE <W:0.0> @>N #9->10
	"guoktatjuohte" Num Sg Nom "guoktatjuohte"phon
		"200" Num Arab Err/Orth Sg Acc "200>"MIDTAPE <W:0.0> @>N #9->10

Akkusativ er:

echo guoktatjuodát+A+Ord+Sg+Acc | hfst-lookup -q tools/tts/generator-gt-norm.hfstol
guoktatjuodát+A+Ord+Sg+Acc	guoktjuodádav	0.000000
guoktatjuodát+A+Ord+Sg+Acc	guoktatjuodádav	0.000000
guoktatjuodát+A+Ord+Sg+Acc	guoktetjuodádav	0.000000

Slik det er skildra her er tanken at vi som siste steg i normaliseringa generer rett form basert på taggane i originalordet. Gjer vi det, eller er det andre problem?

@snomos snomos added the bug Something isn't working label Oct 23, 2023
@snomos snomos changed the title 200 som tekst blir ikkje generert i akkusativ TTS: 200 som tekst blir ikkje generert i akkusativ Oct 23, 2023
@flammie
Copy link
Contributor

flammie commented Oct 23, 2023

slik som den regenerasjonsteg er nå det pröver å generere guoktatjuodát+Num+Arab+Sg+Acc, den mangler noen ikke-triviell logik å få Adj+Ord fra Num+Arab?

@snomos
Copy link
Member Author

snomos commented Oct 23, 2023

Ah - eg hadde missa at det ikkje var same ordklasse. Det er sjølvsagt ei anna historie. Før vi går vidare - @ilm024 kva er rett ordform i denne konteksten? Kva er det vi burde generera? Er det ei av desse formene?

echo guoktatjuohte+Num+Sg+Acc | hfst-lookup -q tools/tts/generator-gt-norm.hfstol                                            
guoktatjuohte+Num+Sg+Acc	guoktjuodev	0.000000
guoktatjuohte+Num+Sg+Acc	guoktatjuodev	0.000000
guoktatjuohte+Num+Sg+Acc	guoktetjuodev	0.000000
guoktatjuohte+Num+Sg+Acc	guovtetjuodev	0.000000

?

@snomos
Copy link
Member Author

snomos commented Oct 23, 2023

Når eg ser ein gong til på dette dømet, så er det likevel rett fram etter den algoritmen vi har lagt til grunn. Algoritmen er (kopiert frå dokumentet eg lenka til lenger opp):

  1. generate new lemma using normaliser FST
  2. Take the original analysis, and remove every prefixed tag (prefixed tags are those of the form Abcd/xxx, where Abcd/ is the tag prefix) + the target tag (ABBR in this case):
    Area/NO N Sem/Hum ABBR Gram/TAbbr Sg AccN Sg Acc
  3. Use the new lemma and the new analysis string to generate the corresponding surface form:
    dåktår N Sg Accdåktårav

I dømet med 200 så blir det slik:

"<200>"
	"guoktatjuodát" A Ord Sg Nom "guoktatjuodát"phon
		"200" Num Arab Err/Orth Sg Acc "200>"MIDTAPE <W:0.0> @>N #9->10
	"guoktatjuohte" Num Sg Nom "guoktatjuohte"phon
		"200" Num Arab Err/Orth Sg Acc "200>"MIDTAPE <W:0.0> @>N #9->10

Av desse to analysene er den fyrste irrelevant, fordi ordklassen ikkje stemmer - vi har Num inn, og skal ha Num ut, og dermed kan vi sjå bort frå A Ord-analysen. Deretter fjernar vi måltaggen Arab + alle prefiks-taggar frå originalanalysen (Num Arab Err/Orth Sg Acc), og då står vi att med Num Sg Acc. Dette er nøyaktig det vi treng for å generera den ordforma vi vil ha (om det er den vi vil ha, det må altså @ilm024 svara på 🙂 ).

@flammie
Copy link
Contributor

flammie commented Oct 25, 2023

ja alså nå ble Num brukt som tagg för normalisering men om det var Arab og vi kan altid ta den bort det kan gå bra.

@snomos
Copy link
Member Author

snomos commented Oct 25, 2023

Hm, vi kan ikkje bruka Num som trigger for normalisering, Num er jo ein tagg som blir brukt for talord skrive ut som tekst òg, og som difor ikkje treng normalisering. Eg skal endra til Arab, slik at vi kan ta bort Arab fordi det var Arab som var taggen som trigga normaliseringa. Logikken må vera at taggen som triggar normalisering er den vi vil ha bort etter normalisering.

@snomos
Copy link
Member Author

snomos commented Oct 25, 2023

Då har eg endra Num og Ord til Arab, i pipespec.xml.in 🙂

@lynnda-hill
Copy link
Contributor

Fiksa disambiguatoren slik at ikke Err/Orth blir valgt. Nå er resultatet slik (dvs. Gen siden jage også blir disambiguert til Gen):

Kan jeg lukke buggen?

"<200>"
        "200" Num Arab Sg Gen "200>"MIDTAPE <W:0.0> SELECT:1387:Arab
;       "200" Num Arab Err/Orth Ess "200>"MIDTAPE <W:0.0> SELECT:1387:Arab REMOVE:4021:errsub
;       "200" Num Arab Err/Orth Sg Acc "200>"MIDTAPE <W:0.0> SELECT:1387:Arab REMOVE:4021:errsub
;       "200" Num Arab Err/Orth Sg Com "200>"MIDTAPE <W:0.0> SELECT:1387:Arab REMOVE:4021:errsub
;       "200" Num Arab Sg Ela Attr "200"MIDTAPE <W:0.0> SELECT:1387:Arab IFF:3194
;       "200" Num Arab Sg Ill Attr "200"MIDTAPE <W:0.0> SELECT:1387:Arab IFF:3194
;       "200" Num Arab Sg Ine Attr "200"MIDTAPE <W:0.0> SELECT:1387:Arab IFF:3194
;       "200" Num Arab Sg Nom "200>"MIDTAPE <W:0.0> SELECT:1387:Arab REMOVE:3396
;       "200" Num Sem/ID "200"MIDTAPE <W:0.0> SELECT:1387:Arab
: 
"<jage>"
        "jahke" N <smj> Sem/Time Sg Gen "jahke>Q1"MIDTAPE <W:0.0> SELECT:2523 SUBSTITUTE:4028
;       "jahke" N Sem/Time Pl Nom "jahke>Q1"MIDTAPE <W:0.0> SELECT:2523
: 
"<maŋŋela>"
        "maŋŋel" N <smj> Sem/Dummytag Sg Gen "maŋŋela"MIDTAPE <W:0.0> SUBSTITUTE:4028
        "maŋŋela" Adv <smj> "maŋŋela>"MIDTAPE <W:0.0> SUBSTITUTE:4029
        "maŋŋela" Po <smj> "maŋŋela>"MIDTAPE <W:0.0> SUBSTITUTE:4033
        "maŋŋela" Pr <smj> "maŋŋela>"MIDTAPE <W:0.0> SUBSTITUTE:4034

@snomos
Copy link
Member Author

snomos commented Oct 31, 2023

No er disambigueringa i orden, men framleis er det problem med normaliseringa. Det som går inn til normaliseraren er dette, i genitiv, slik Linda seier (med kommandoen echo 'Dát máhtto de mak jåvsåj Finnmárko sámijda suláj 200 jage maŋŋela Kristusa riegádime.' | ./tools/tts/modes/trace-smj-normaliser8-cg.mode):

"<200>"
	"200" Num Arab Sg Gen "200>"MIDTAPE <W:0.0> SELECT:1388:Arab MAP:1357:>nNum @>N #9->10 SETPARENT:866:SetModToN
;	"200" Num Arab Err/Orth Ess "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:4032:errsub
;	"200" Num Arab Err/Orth Sg Acc "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:4032:errsub
;	"200" Num Arab Err/Orth Sg Com "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:4032:errsub
;	"200" Num Arab Sg Ela Attr "200"MIDTAPE <W:0.0> SELECT:1388:Arab IFF:3195
;	"200" Num Arab Sg Ill Attr "200"MIDTAPE <W:0.0> SELECT:1388:Arab IFF:3195
;	"200" Num Arab Sg Ine Attr "200"MIDTAPE <W:0.0> SELECT:1388:Arab IFF:3195
;	"200" Num Arab Sg Nom "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:3397
;	"200" Num Sem/ID "200"MIDTAPE <W:0.0> SELECT:1388:Arab

Men etter normaliseringa er det framleis nominativ, og vi har ei ekstra A Ord-analyse:

"<200>"
	"guoktatjuodát" A Ord Sg Nom "guoktatjuodát"phon "200"oldlemma
	"guoktatjuohte" Num Sg Nom "guoktatjuohte"phon "200"oldlemma
;	"200" Num Arab Err/Orth Ess "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:4032:errsub
;	"200" Num Arab Err/Orth Sg Acc "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:4032:errsub
;	"200" Num Arab Err/Orth Sg Com "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:4032:errsub
;	"200" Num Arab Sg Ela Attr "200"MIDTAPE <W:0.0> SELECT:1388:Arab IFF:3195
;	"200" Num Arab Sg Ill Attr "200"MIDTAPE <W:0.0> SELECT:1388:Arab IFF:3195
;	"200" Num Arab Sg Ine Attr "200"MIDTAPE <W:0.0> SELECT:1388:Arab IFF:3195
;	"200" Num Arab Sg Nom "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:3397
;	"200" Num Sem/ID "200"MIDTAPE <W:0.0> SELECT:1388:Arab

@flammie
Copy link
Contributor

flammie commented Oct 31, 2023

den har blitt lit komplisert så ä skrev ut alle versioner med full trace i dagens version:

echo 'Dát máhtto de mak jåvsåj Finnmárko sámijda suláj 200 jage maŋŋela Kristusa riegádime.' | ~/github/giellalt/lang-smj/tools/tts/modes/smj-normaliser8-cg.mode |  ~/github/divvun/libdivvun/src/divvun-normaliser -a  '/home/flammie/github/giellalt/lang-smj/tools/tts/analyser-gt-norm.hfstol' -g  '/home/flammie/github/giellalt/lang-smj/tools/tts/generator-gt-norm.hfstol' -n  '/home/flammie/github/giellalt/lang-smj/tools/tts/transcriptor-gt-desc.hfstol' -t ABBR -t Arab -t Ord -t Symbol -v
Being verbose.
Surface analyser set to: /home/flammie/github/giellalt/lang-smj/tools/tts/analyser-gt-norm.hfstol
Normaliser set to: /home/flammie/github/giellalt/lang-smj/tools/tts/transcriptor-gt-desc.hfstol
Generator set to: /home/flammie/github/giellalt/lang-smj/tools/tts/generator-gt-norm.hfstol
Deep analyser set to: 
Tags set to: ABBR Arab Ord Symbol 
Reading files: 
* /home/flammie/github/giellalt/lang-smj/tools/tts/transcriptor-gt-desc.hfstol
* /home/flammie/github/giellalt/lang-smj/tools/tts/generator-gt-norm.hfstol
* /home/flammie/github/giellalt/lang-smj/tools/tts/analyser-gt-norm.hfstol
* 
expanding tags: 
New surface form: Dát
"<Dát>"
Using lemma: dát
No expansion tags in
	"dát" Pron Dem Sg Ela Attr <W:0.0> @>N #1->2
Probably not cg formatted stuff: 
: 
New surface form: máhtto
"<máhtto>"
Using lemma: máhtto
No expansion tags in
	"máhtto" N Sem/Prod-cogn Sg Nom <W:0.0> @SUBJ> #2->0
Probably not cg formatted stuff: 
: 
New surface form: de
"<de>"
Using lemma: de
No expansion tags in
	"de" Adv <W:0.0> @ADVL> #3->5
Probably not cg formatted stuff: 
: 
New surface form: mak
"<mak>"
Using lemma: mak
No expansion tags in
	"mak" Adv <W:0.0> @ADVL> #4->5
Probably not cg formatted stuff: 
: 
New surface form: jåvsåj
"<jåvsåj>"
Using lemma: jåksåt
No expansion tags in
	"jåksåt" <mv> V TV Ind Prt Sg3 <W:0.0> @FMV #5->0
Probably not cg formatted stuff: 
: 
New surface form: Finnmárko
"<Finnmárko>"
Using lemma: Finnmárkko
No expansion tags in
	"Finnmárkko" OLang/NOB N Prop Sem/Plc Sg Gen <W:0.0> @>N #6->7
Probably not cg formatted stuff: 
: 
New surface form: sámijda
"<sámijda>"
Using lemma: sábme
No expansion tags in
	"sábme" N Sem/Hum_Lang Pl Ill <W:0.0> @<ADVL #7->5
Probably not cg formatted stuff: 
: 
New surface form: suláj
"<suláj>"
Using lemma: sulla
No expansion tags in
	"sulla" N Sem/Dummytag Pl Com <W:0.0> @<ADVL #8->5
Probably not cg formatted stuff: 
: 
New surface form: 200
"<200>"
Expanding because of Arab
Using lemma: 200
1. looking up normaliser
2.a Using normalised form: guoktatjuodát
2.b regenerating lookup: guoktatjuodát+Num+Sg+Gen
3. Couldn't regenerate, reanalysing lemma: guoktatjuodát
	"guoktatjuodát" A Ord Sg Nom "guoktatjuodát"phon "200"oldlemma
2.a Using normalised form: guoktatjuohte
2.b regenerating lookup: guoktatjuohte+Num+Sg+Gen
3. reanalysing: guoktjuode
	"guoktatjuohte" Num Pl Nom "guoktjuode"phon "200"oldlemma
	"guoktatjuohte" Num Sg Gen "guoktjuode"phon "200"oldlemma
	"guoktatjuohte" Num Sg Ill Attr "guoktjuode"phon "200"oldlemma
3. reanalysing: guoktatjuode
	"guoktatjuohte" Num Pl Nom "guoktatjuode"phon "200"oldlemma
	"guoktatjuohte" Num Sg Gen "guoktatjuode"phon "200"oldlemma
	"guoktatjuohte" Num Sg Ill Attr "guoktatjuode"phon "200"oldlemma
3. reanalysing: guoktetjuode
	"guoktatjuohte" Num Pl Nom "guoktetjuode"phon "200"oldlemma
	"guoktatjuohte" Num Sg Gen "guoktetjuode"phon "200"oldlemma
	"guoktatjuohte" Num Sg Ill Attr "guoktetjuode"phon "200"oldlemma
3. reanalysing: guovtetjuode
	"guoktatjuohte" Num Attr "guovtetjuode"phon "200"oldlemma
	"guoktatjuohte" Num Sg Gen "guovtetjuode"phon "200"oldlemma
Probably not cg formatted stuff: 
: 
New surface form: jage
"<jage>"
Using lemma: jahke
No expansion tags in
	"jahke" N Sem/Time Sg Gen <W:0.0> @<ADVL #10->5
Probably not cg formatted stuff: 
: 
New surface form: maŋŋela
"<maŋŋela>"
Using lemma: maŋŋel
No expansion tags in
	"maŋŋel" N Sem/Dummytag Sg Gen <W:0.0> @>N #11->12
Probably not cg formatted stuff: 
: 
New surface form: Kristusa
"<Kristusa>"
Using lemma: Kristus
No expansion tags in
	"Kristus" OLang/UND N Prop Sem/Mal Sg Gen <W:0.0> @P< #12->12
Probably not cg formatted stuff: 
: 
New surface form: riegádime
"<riegádime>"
Using lemma: riegádibme
No expansion tags in
	"riegádibme" N Sem/Dummytag Gram/NomAct Pl Nom <W:0.0> @<SUBJ #13->5
New surface form: .
"<.>"
Using lemma: .
No expansion tags in
	"." CLB <W:0.0> #14->2
Probably not cg formatted stuff: 
:\n
Probably not cg formatted stuff: 

eller det er nesten densamme som:

$ echo 200 | hfst-lookup ~/github/giellalt/lang-smj/tools/tts/transcriptor-gt-desc.hfstol -q
200	guoktatjuodát	0,000000
200	guoktatjuohte	0,000000
echo guoktatjuodát+Num+Sg+Gen | hfst-lookup ~/github/giellalt/lang-smj/tools/tts/generator-gt-norm.hfstol -q
guoktatjuodát+Num+Sg+Gen	guoktatjuodát+Num+Sg+Gen+?	inf
$ echo guoktatjuodát | hfst-lookup ~/github/giellalt/lang-smj/tools/tts/analyser-gt-norm.hfstol -q
guoktatjuodát	guoktatjuodát+A+Ord+Attr	0,000000
guoktatjuodát	guoktatjuodát+A+Ord+Sg+Nom	0,000000
$ echo guoktatjuohte+Num+Sg+Gen  | hfst-lookup ~/github/giellalt/lang-smj/tools/tts/generator-gt-norm.hfstol -q
guoktatjuohte+Num+Sg+Gen	guoktjuode	0,000000
guoktatjuohte+Num+Sg+Gen	guoktatjuode	0,000000
guoktatjuohte+Num+Sg+Gen	guoktetjuode	0,000000
guoktatjuohte+Num+Sg+Gen	guovtetjuode	0,000000
$ echo guoktjuode | hfst-lookup ~/github/giellalt/lang-smj/tools/tts/analyser-gt-norm.hfstol -q
guoktjuode	guoktatjuohte+Num+Pl+Nom	0,000000
guoktjuode	guoktatjuohte+Num+Sg+Gen	0,000000
guoktjuode	guoktatjuohte+Num+Sg+Ill+Attr	0,000000

osv.

@snomos
Copy link
Member Author

snomos commented Nov 1, 2023

So in the debug version it all looks good, except we could throw away some stuff, and we need to restrict the normaliser a bit, to not generate four variants of the same morphosyntactic form. I have commented the relevant parts below:

"<200>"
Expanding because of Arab
Using lemma: 200
1. looking up normaliser
2.a Using normalised form: guoktatjuodát
2.b regenerating lookup: guoktatjuodát+Num+Sg+Gen
3. Couldn't regenerate, reanalysing lemma: guoktatjuodát
	"guoktatjuodát" A Ord Sg Nom "guoktatjuodát"phon "200"oldlemma

guoktatjuodát should be thrown away, since 'A' does not match 'Num'. The following normalised string is what we want:

2.a Using normalised form: guoktatjuohte
2.b regenerating lookup: guoktatjuohte+Num+Sg+Gen
3. reanalysing: guoktjuode
	"guoktatjuohte" Num Pl Nom "guoktjuode"phon "200"oldlemma
	"guoktatjuohte" Num Sg Gen "guoktjuode"phon "200"oldlemma
	"guoktatjuohte" Num Sg Ill Attr "guoktjuode"phon "200"oldlemma
3. reanalysing: guoktatjuode
	"guoktatjuohte" Num Pl Nom "guoktatjuode"phon "200"oldlemma
	"guoktatjuohte" Num Sg Gen "guoktatjuode"phon "200"oldlemma
	"guoktatjuohte" Num Sg Ill Attr "guoktatjuode"phon "200"oldlemma
3. reanalysing: guoktetjuode
	"guoktatjuohte" Num Pl Nom "guoktetjuode"phon "200"oldlemma
	"guoktatjuohte" Num Sg Gen "guoktetjuode"phon "200"oldlemma
	"guoktatjuohte" Num Sg Ill Attr "guoktetjuode"phon "200"oldlemma
3. reanalysing: guovtetjuode
	"guoktatjuohte" Num Attr "guovtetjuode"phon "200"oldlemma
	"guoktatjuohte" Num Sg Gen "guovtetjuode"phon "200"oldlemma

These forms are all good, but we need to restrict the normaliser so that it only generates the form we want (we need @ilm024 to decide which one). Alternatively, we generate all, but give them variant tags in the reanalysis, with enough information to select using CG in the next step. That way we can generate forms that fits with the rest of the text, given that there are clues in the rest of the text as to which version/style to pick. If we go this route, we still need to designate one variant as the default, probably tagged v1, and select that if no other information is given. We need to end up with one variant only, but that does not need to happen in the normaliser step. The normaliser should probably give a warning, though, in cases where there are several alternative outputs with the exact same analysis. That is, in the example above, we should get a warning for all four Num Sg Gen variants, since there are no variant tags to differentiate them. And the normaliser should only return the first one in this case. If the normaliser returns several forms, with different tags in the reanalysis, return them all.

What I do not understand is why we end up with:

"<200>"
	"guoktatjuohte" Num Sg Nom "guoktatjuohte"phon "200"oldlemma

ie guoktatjuohte and Num Sg Nom, when the reanalysed forms clearly says f.ex. guoktatjuode and Num Sg Gen, as in e.g.:

3. reanalysing: guoktjuode
	"guoktatjuohte" Num Sg Gen "guoktjuode"phon "200"oldlemma

So although everything is correct, we end up with the wrong form. That looks like a bug somewhere.

The other forms, like Num Pl Nom and Num Sg Ill Attr, should be filtered out because of tag mismatch with the input tags.

@flammie
Copy link
Contributor

flammie commented Nov 1, 2023

Current version should throw away all tag strings that don't match (with ; in debug mode).

@snomos
Copy link
Member Author

snomos commented Nov 2, 2023

Thanks, I just tested it. This is what I get:

echo 'Dát máhtto de mak jåvsåj Finnmárko sámijda suláj 200 jage maŋŋela Kristusa riegádime.' \
| ./tools/tts/modes/trace-smj-normaliser.mode
"<200>"
	"200" Num Arab Sg Gen "200>"MIDTAPE <W:0.0> SELECT:1388:Arab MAP:1357:>nNum @>N #9->10 SETPARENT:866:SetModToN
;	"200" Num Arab Err/Orth Ess "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:4032:errsub
;	"200" Num Arab Err/Orth Sg Acc "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:4032:errsub
;	"200" Num Arab Err/Orth Sg Com "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:4032:errsub
;	"200" Num Arab Sg Ela Attr "200"MIDTAPE <W:0.0> SELECT:1388:Arab IFF:3195
;	"200" Num Arab Sg Ill Attr "200"MIDTAPE <W:0.0> SELECT:1388:Arab IFF:3195
;	"200" Num Arab Sg Ine Attr "200"MIDTAPE <W:0.0> SELECT:1388:Arab IFF:3195
;	"200" Num Arab Sg Nom "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:3397
;	"200" Num Sem/ID "200"MIDTAPE <W:0.0> SELECT:1388:Arab

So no conversion anymore. I then run it on just 200, with more details:

echo 200 | hfst-tokenise -g tools/tts/tokeniser-tts-cggt-desc.pmhfst | \
  egrep '(^"| Gen )' | \
divvun-normaliser -v -a tools/tts/analyser-gt-norm.hfstol \
-g tools/tts/generator-gt-norm.hfstol -n tools/tts/transcriptor-gt-desc.hfstol -t Arab
Being verbose.
Surface analyser set to: tools/tts/analyser-gt-norm.hfstol
Normaliser set to: tools/tts/transcriptor-gt-desc.hfstol
Generator set to: tools/tts/generator-gt-norm.hfstol
Deep analyser set to: 
Tags set to: Arab 
Reading files: 
* tools/tts/transcriptor-gt-desc.hfstol
* tools/tts/generator-gt-norm.hfstol
* tools/tts/analyser-gt-norm.hfstol
* 
expanding tags: 
New surface form: 200
"<200>"
Expanding because of Arab
Using lemma: 200
1. looking up normaliser
2.a Using normalised form: guoktatjuodát
2.b regenerating lookup: guoktatjuodát+Num+Sg+Gen+MIDTAPE
3. Couldn't regenerate, reanalysing lemma: guoktatjuodát
;	"guoktatjuodát" A Ord Attr "guoktatjuodát"phon "200"oldlemma NORMALISER_REMOVE:notgenerated
;	"guoktatjuodát" A Ord Sg Nom "guoktatjuodát"phon "200"oldlemma NORMALISER_REMOVE:notgenerated
2.a Using normalised form: guoktatjuohte
2.b regenerating lookup: guoktatjuohte+Num+Sg+Gen+MIDTAPE
3. Couldn't regenerate, reanalysing lemma: guoktatjuohte
;	"guoktatjuohte" Num Sg Nom "guoktatjuohte"phon "200"oldlemma NORMALISER_REMOVE:notgenerated
	"200" Num Arab Sg Gen "200>"MIDTAPE <W:0.0>

For whatever reason it is not able to generate. I then try the analysis and generation steps with the fst's used by the normaliser:

echo guoktatjuohte | hfst-lookup -q tools/tts/analyser-gt-norm.hfstol                                                                                         
guoktatjuohte	guoktatjuohte+Num+Sg+Nom	0.000000

echo guoktatjuohte+Num+Sg+Nom | hfst-lookup -q tools/tts/generator-gt-norm.hfstol                                                                                                                               
guoktatjuohte+Num+Sg+Nom	guoktjuohte	0.000000
guoktatjuohte+Num+Sg+Nom	guoktatjuohte	0.000000
guoktatjuohte+Num+Sg+Nom	guoktetjuohte	0.000000

echo guoktatjuohte+Num+Sg+Gen | hfst-lookup -q tools/tts/generator-gt-norm.hfstol                                                                                                                               
guoktatjuohte+Num+Sg+Gen	guoktjuode	0.000000
guoktatjuohte+Num+Sg+Gen	guoktatjuode	0.000000
guoktatjuohte+Num+Sg+Gen	guoktetjuode	0.000000
guoktatjuohte+Num+Sg+Gen	guovtetjuode	0.000000

No problems whatsoever. So the question is: why can't the generator generate when used in the normaliser, when there is no problems when used directly on the command line?

@snomos
Copy link
Member Author

snomos commented Nov 2, 2023

Det kan sjå ut som om MIDTAPE blandar seg inn i genereringa - er det rett streng som blir sendt til generatoren? Jf:

2.b regenerating lookup: guoktatjuohte+Num+Sg+Gen+MIDTAPE

@snomos
Copy link
Member Author

snomos commented Nov 2, 2023

Om det er det så vil det forklara kvifor genereringa ikkje går gjennom 🙂

@flammie
Copy link
Contributor

flammie commented Nov 2, 2023

ah ja det er sant men det går litt på den tema vi snakke siste uke at det er forskjellige cg-parsers alla steder i kodebase, denne var ikke särleg flink med tagtolkingar.

@snomos
Copy link
Member Author

snomos commented Nov 2, 2023

Ok. Vi burde kanskje ha berre ein kodebase for å parsa CG, ev bruka kode frå VislCG3-koden?

@snomos
Copy link
Member Author

snomos commented Nov 3, 2023

Fixed in divvun/libdivvun@d0647bf:

echo 'Dát máhtto de mak jåvsåj Finnmárko sámijda suláj 200 jage maŋŋela Kristusa riegádime.' | \
./tools/tts/modes/trace-smj-normaliser.mode
...
"<200>"
	"guoktatjuohte" Num Sg Gen "guoktjuode"phon "200"oldlemma
	"guoktatjuohte" Num Sg Gen "guoktatjuode"phon "200"oldlemma
	"guoktatjuohte" Num Sg Gen "guoktetjuode"phon "200"oldlemma
	"guoktatjuohte" Num Sg Gen "guovtetjuode"phon "200"oldlemma
;	"200" Num Arab Err/Orth Ess "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:4032:errsub
;	"200" Num Arab Err/Orth Sg Acc "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:4032:errsub
;	"200" Num Arab Err/Orth Sg Com "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:4032:errsub
;	"200" Num Arab Sg Ela Attr "200"MIDTAPE <W:0.0> SELECT:1388:Arab IFF:3195
;	"200" Num Arab Sg Ill Attr "200"MIDTAPE <W:0.0> SELECT:1388:Arab IFF:3195
;	"200" Num Arab Sg Ine Attr "200"MIDTAPE <W:0.0> SELECT:1388:Arab IFF:3195
;	"200" Num Arab Sg Nom "200>"MIDTAPE <W:0.0> SELECT:1388:Arab REMOVE:3397
;	"200" Num Sem/ID "200"MIDTAPE <W:0.0> SELECT:1388:Arab

Great! We can move on to the next bug 🙂

@ilm024
Copy link
Contributor

ilm024 commented Nov 10, 2023

Vi skal ha Gen som "guoktatjuode". Guoktjuode går ikke, da det ikke er en sammensetningsdel foran. Dette er på min "to do" liste", men jeg trenger hjelp, da jeg ikke får det til selv. "Guokte" er allerede tagget med Use/Marg,kan man ikke styre unna disee automatisk?

@snomos
Copy link
Member Author

snomos commented Nov 10, 2023

Ja, det skal skje automatisk. Det krevst litt omorganisering, men det skal bli ordna.

@snomos
Copy link
Member Author

snomos commented Nov 17, 2023

No skjer det automatisk basert på taggane du har lagt inn, @ilm024 🙂

echo 'Dát máhtto de mak jåvsåj Finnmárko sámijda suláj 200 jage maŋŋela Kristusa riegádime.' | \
./tools/tts/modes/trace-smj-normaliser.mode
...
"<sámijda>"
	"sábme" N Sem/Hum_Lang Pl Ill "sábme>Q1jda"MIDTAPE <W:0.0> @<ADVL #7->5
: 
"<suláj>"
	"sulla" N Sem/Dummytag Pl Com "sulla>Q1j"MIDTAPE <W:0.0> @<ADVL #8->5
: 
"<200>"
	"guoktatjuodát" A Ord Attr "guoktatjuodát"phon "200"oldlemma
	"guoktatjuodát" A Ord Sg Nom "guoktatjuodát"phon "200"oldlemma
	"guoktatjuohte" Num Sg Gen "guoktatjuode"phon "200"oldlemma
: 
"<jage>"
	"jahke" N Sem/Time Sg Gen "jahke>Q1"MIDTAPE <W:0.0> @<ADVL #10->5
: 
"<maŋŋela>"
	"maŋŋel" N Sem/Dummytag Sg Gen "maŋŋela"MIDTAPE <W:0.0> @>N #11->12

@flammie når det gjeld ordenstalsforma som dukkar opp, så ser det ut som eit steg tilbake (ein regresjon) - du hadde jo løyst det problemet tidlegare? Dvs ignorer genererte former som ikkje stemmer i POS med den ordklassa vi sender inn, stemmer ikkje det? Så kva skjer her?

@snomos snomos reopened this Nov 17, 2023
@flammie
Copy link
Contributor

flammie commented Nov 17, 2023

tror dem var kasta bort för pga genereringsfeil, som fiks til #36 så bruker vi lemmaform fra transcriptor uansett. Kanskje ä kan bare matcha taggenne med denne form også...

@snomos
Copy link
Member Author

snomos commented Nov 18, 2023

Etter at eg fekk testa med nyaste bygg av libdivvun, kan eg stadfesta at ting fungerer som dei skal:

echo 'Dát máhtto de mak jåvsåj Finnmárko sámijda suláj 200 jage maŋŋela Kristusa riegádime.' | \
./tools/tts/modes/smj-normaliser.mode
[...]
"<suláj>"
	"sulla" N Sem/Dummytag Pl Com "sulla>Q1j"MIDTAPE <W:0.0> @<ADVL #8->5
: 
"<200>"
	"guoktatjuohte" Num Sg Gen "guoktatjuode"phon "200"oldlemma
: 
"<jage>"
	"jahke" N Sem/Time Sg Gen "jahke>Q1"MIDTAPE <W:0.0> @<ADVL #10->5
: 
"<maŋŋela>"
	"maŋŋel" N Sem/Dummytag Sg Gen "maŋŋela"MIDTAPE <W:0.0> @>N #11->12

Ingen fleire fleirtydige former, berre den vi vil ha 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants