Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrong tokenization of date ranges #32

Closed
lynnda-hill opened this issue Nov 1, 2021 · 1 comment
Closed

wrong tokenization of date ranges #32

lynnda-hill opened this issue Nov 1, 2021 · 1 comment
Assignees

Comments

@lynnda-hill
Copy link
Contributor

In the following example we want to tokenize "7.-11.5." as a date (range).

Ságastallan lea rabas neahtas 7.-11.5., muhto dárbbu mielde ságastallanáiggi sáhttá guhkidit.

Instead it is tokenized in the following way:

"<7.-11.5>"
"7.-11.5" Num Sem/ID <W:0.0> #5->5
"<.>"
"." CLB <W:0.0> &no-space-after-punct-mark #6->6 ID:6 R:RIGHT:8 ADD:9780:no-space-after-punct ADD:9780:no-space-after-punct
no-space-after-punct-mark
"." CLB <W:0.0> ". ,"S &no-space-after-punct-mark &SUGGESTWF #6->6 ID:6 R:RIGHT:8 ADD:9780:no-space-after-punct COPY:9797:no-space-after-punct-sugg
no-space-after-punct-mark

"<,>"
"," CLB <W:0.0> &no-space-after-punct-mark #1->1 ID:8 ADD:9790:no-space-after-punct-link ADD:9790:no-space-after-punct-link
no-space-after-punct-mark
"," CLB <W:0.0> &LINK #1->1 ID:8 ADD:9790:no-space-after-punct-link ADDRELATION(RIGHT):9795:no-space-after-punct-rel ADD:9790:no-space-after-punct-link
:

The problem is that since "." is not part of the date, it is tokenized as a sentence boundary.

@flammie
Copy link
Contributor

flammie commented Jan 6, 2022

echo 'Ságastallan lea rabas neahtas 7.-11.5., muhto dárbbu mielde ságastallanáiggi sáhttá guhkidit.' | tools/grammarcheckers/modes/smegramrelease-dev.mode 
"<Ságastallan>"
	"ságastallan" N <NomGenSg> Sem/Act Sg Nom <W:0.0> <firstCohort> @SUBJ> #1->1
	"ságastallat" Ex/V TV Der/NomAct N <NomGenSg> Sg Nom <W:0.0> <firstCohort> @SUBJ> #1->1
	"ságastit" Ex/V TV Der/alla Ex/V Der/NomAct N <NomGenSg> Sg Nom <W:0.0> <firstCohort> @SUBJ> #1->1
: 
"<lea>"
	"leat" <mv> V <copula> <TH-Nom-Any> <mielde> <OR-Loc-HumGroup> <OR-eret-Plc> <dušše><TH-Inf> <árvvus> <LO-Loc-johtu><DE-Ill-Plc> <AT-Loc-Mat> <AT-Abe-Any> <AT-Nom-Any> <AT-Nom-Adj><EX-Ill-Ani> <PO-Loc-Hum> <PO-Gen-Hum> <MA-mielde-Any> <MA-Adv-Manner> <XT-Gen-Measr> <LO-maŋŋil-Time> <LO-Acc-Time> <LO-Loc-Time> <CO-Com-Ani> <ID-Nom-Any> <TH-Nom-Any><RO-Ess-Any><EX-Ill-Any> <EX-Ill-Ani><TH-Nom-Adj> <EX-Ill-Ani> <TH-Nom-Obj><RE-Ill-Ani> <LO-Loc-Any> <AktioEss> <BE-Ill-Ani><PU-Ess-Any> <RO-Ess-Any><PU-Ill-Act> <RO-Ess-Any> <Inf> IV Ind Prs Sg3 <W:0.0> @+FMAINV #2->2
: 
"<rabas>"
	"rabas" A Sem/Hum Attr <W:0.0> @>N #3->3
	"rabas" Adv <W:0.0> @<ADVL #3->3
: 
"<neahtas>"
	"neahtta" N Sem/Dummytag Sg Loc <W:0.0> @<ADVL #4->4
: 
"<7.-11.5.>"
	"7.-11.5" Num Sem/Date Sg Gen <W:0.0> <NoSpaceAfterPunctMark> @>N #5->5
"<,>"
	"," CLB <W:0.0> #6->6
: 
"<muhto>"
	"muhto" CC <W:0.0> @CVP #7->7
: 
"<dárbbu>"
	"dárbu" N <TH-Inf> <TH-Ill-Any> Sem/Perc-phys Sg Gen <W:0.0> @>P #8->9
: 
"<mielde>"
	"mielde" Po <W:0.0> @ADVL> #9->9
: 
"<ságastallanáiggi>"
	"ságastallanáigi" N Sem/Time Sg Acc <W:0.0> <cohort-with-dynamic-compound> @ADVL> #10->10
	"ságastallanáigi" N Sem/Time Sg Gen <W:0.0> <cohort-with-dynamic-compound> @ADVL> #10->10
: 
"<sáhttá>"
	"sáhttit" <aux> V <TH-Acc-Obj><XT-Acc-Measure> <DE-Ill-Plc> <Inf> IV Ind Prs Sg3 <W:0.0> @+FAUXV #11->11
: 
"<guhkidit>"
	"guhkidit" <mv> V <TH-Acc-Any><SO-Loc-Any><DE-Ill-Any> <TH-Acc-Any><DE-Ill-*Ani> <PA-Acc-Any><XT-Com-Measure> <PA-Acc-Any> TV Inf <W:0.0> @-FMAINV #12->12
"<.>"
	"." CLB <W:0.0> <LastCohort> #13->13
:\n

The issue was that date regexp had obligatory leading 0 for dates < 10... this may cause some extra ambiguity.

flammie added a commit to giellalt/shared-smi that referenced this issue May 10, 2022
snomos pushed a commit to giellalt/shared-smi that referenced this issue May 24, 2022
snomos pushed a commit to giellalt/shared-smi that referenced this issue May 24, 2022
snomos pushed a commit to giellalt/shared-smi that referenced this issue May 25, 2022
snomos pushed a commit to giellalt/shared-smi that referenced this issue May 25, 2022
snomos pushed a commit to giellalt/shared-smi that referenced this issue Jun 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants