-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wrong tokenization of date ranges #32
Comments
The issue was that date regexp had obligatory leading 0 for dates < 10... this may cause some extra ambiguity. |
flammie
added a commit
to giellalt/shared-smi
that referenced
this issue
May 10, 2022
snomos
pushed a commit
to giellalt/shared-smi
that referenced
this issue
May 24, 2022
snomos
pushed a commit
to giellalt/shared-smi
that referenced
this issue
May 24, 2022
snomos
pushed a commit
to giellalt/shared-smi
that referenced
this issue
May 25, 2022
snomos
pushed a commit
to giellalt/shared-smi
that referenced
this issue
May 25, 2022
snomos
pushed a commit
to giellalt/shared-smi
that referenced
this issue
Jun 3, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In the following example we want to tokenize "7.-11.5." as a date (range).
Ságastallan lea rabas neahtas 7.-11.5., muhto dárbbu mielde ságastallanáiggi sáhttá guhkidit.
Instead it is tokenized in the following way:
"<7.-11.5>"
"7.-11.5" Num Sem/ID <W:0.0> #5->5
"<.>"
"." CLB <W:0.0> &no-space-after-punct-mark #6->6 ID:6 R:RIGHT:8 ADD:9780:no-space-after-punct ADD:9780:no-space-after-punct
no-space-after-punct-mark
"." CLB <W:0.0> ". ,"S &no-space-after-punct-mark &SUGGESTWF #6->6 ID:6 R:RIGHT:8 ADD:9780:no-space-after-punct COPY:9797:no-space-after-punct-sugg
no-space-after-punct-mark
"<,>"
"," CLB <W:0.0> &no-space-after-punct-mark #1->1 ID:8 ADD:9790:no-space-after-punct-link ADD:9790:no-space-after-punct-link
no-space-after-punct-mark
"," CLB <W:0.0> &LINK #1->1 ID:8 ADD:9790:no-space-after-punct-link ADDRELATION(RIGHT):9795:no-space-after-punct-rel ADD:9790:no-space-after-punct-link
:
The problem is that since "." is not part of the date, it is tokenized as a sentence boundary.
The text was updated successfully, but these errors were encountered: