Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to specify token breaking zones when calling tokenizer #5043

Closed
reckart opened this issue Sep 7, 2024 · 0 comments
Closed

Ability to specify token breaking zones when calling tokenizer #5043

reckart opened this issue Sep 7, 2024 · 0 comments
Assignees
Labels
⭐️ Enhancement New feature or request
Milestone

Comments

@reckart
Copy link
Member

reckart commented Sep 7, 2024

Is your feature request related to a problem? Please describe.
The tokenizer implicitly respects sentence boundaries. But if there is a token-breaking event inside a sentence, there is no way to communicate that to the tokenizer. Such an event could e.g. be a <br/> milestone element in an XML/HTML file.

Describe the solution you'd like
Allow zone boundaries on the tokenizer call just as we do on the sentence splitter call.

@reckart reckart added the ⭐️ Enhancement New feature or request label Sep 7, 2024
@reckart reckart added this to the 34.0 milestone Sep 7, 2024
@reckart reckart self-assigned this Sep 7, 2024
@reckart reckart added this to Kanban Sep 7, 2024
@github-project-automation github-project-automation bot moved this to 🔖 To do in Kanban Sep 7, 2024
reckart added a commit that referenced this issue Sep 7, 2024
- Added new signature to the tokenizer call
- Added test
- Consolicated existing code
reckart added a commit that referenced this issue Sep 7, 2024
…to-specify-token-breaking-zones-when-calling-tokenizer

#5043 - Ability to specify token breaking zones when calling tokenizer
@reckart reckart closed this as completed Sep 7, 2024
@github-project-automation github-project-automation bot moved this from 🔖 To do to 🍹 Done in Kanban Sep 7, 2024
reckart added a commit that referenced this issue Sep 8, 2024
* main: (114 commits)
  #5047 - Clean up layer detail UI a bit
  #4949 - Showing the start and end points of relations in left side bar
  No issue: Minor cleaning up
  #5043 - Ability to specify token breaking zones when calling tokenizer
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare release inception-34.0-beta-6
  #5009 - Better handling of stacked annotations with link features in curation
  #5009 - Better handling of stacked annotations with link features in curation
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare release inception-34.0-beta-6
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare release inception-33.6
  #5040 - Improve feature form tab navigation
  #5037 - Show fraction of annotators that chose a certain label in curation sidebar mode
  #5035 - NBSPs should not be treated as tokens
  #5031 - ChatGPT recommender fails because format is not a supported parameter
  #5029 - Duplicate lines on the about page
  #5027 - Add more CSP configurations
  #4753 - Entity linker should skip already linked concepts
  #5007 - Lazy details on suggestions for multi-value concept features fail rendering
  ...

% Conflicts:
%	pom.xml
reckart added a commit that referenced this issue Sep 8, 2024
…rs-interactively-on-the-annotation-page

* main:
  #5047 - Clean up layer detail UI a bit
  #4949 - Showing the start and end points of relations in left side bar
  No issue: Minor cleaning up
  #5043 - Ability to specify token breaking zones when calling tokenizer
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare release inception-34.0-beta-6
  #5009 - Better handling of stacked annotations with link features in curation
  #5009 - Better handling of stacked annotations with link features in curation
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare release inception-34.0-beta-6
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare release inception-33.6
  #5040 - Improve feature form tab navigation
  #5037 - Show fraction of annotators that chose a certain label in curation sidebar mode
  #5035 - NBSPs should not be treated as tokens
  #4904 - Upgrade to RDF4J 5.x
reckart added a commit that referenced this issue Sep 8, 2024
…code

* main: (42 commits)
  #5047 - Clean up layer detail UI a bit
  #4949 - Showing the start and end points of relations in left side bar
  No issue: Minor cleaning up
  #5043 - Ability to specify token breaking zones when calling tokenizer
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare release inception-34.0-beta-6
  #5009 - Better handling of stacked annotations with link features in curation
  #5009 - Better handling of stacked annotations with link features in curation
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare release inception-34.0-beta-6
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare release inception-33.6
  #5040 - Improve feature form tab navigation
  #5037 - Show fraction of annotators that chose a certain label in curation sidebar mode
  #5035 - NBSPs should not be treated as tokens
  #5031 - ChatGPT recommender fails because format is not a supported parameter
  #5029 - Duplicate lines on the about page
  #5027 - Add more CSP configurations
  #4753 - Entity linker should skip already linked concepts
  #5007 - Lazy details on suggestions for multi-value concept features fail rendering
  ...
reckart added a commit that referenced this issue Sep 8, 2024
* main:
  #5047 - Clean up layer detail UI a bit
  #4949 - Showing the start and end points of relations in left side bar
  No issue: Minor cleaning up
  #5043 - Ability to specify token breaking zones when calling tokenizer
  #4904 - Upgrade to RDF4J 5.x
reckart added a commit that referenced this issue Sep 8, 2024
* main: (359 commits)
  #5047 - Clean up layer detail UI a bit
  #4949 - Showing the start and end points of relations in left side bar
  No issue: Minor cleaning up
  #5043 - Ability to specify token breaking zones when calling tokenizer
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare release inception-34.0-beta-6
  #5009 - Better handling of stacked annotations with link features in curation
  #5009 - Better handling of stacked annotations with link features in curation
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare release inception-34.0-beta-6
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare release inception-33.6
  #5040 - Improve feature form tab navigation
  #5037 - Show fraction of annotators that chose a certain label in curation sidebar mode
  #5035 - NBSPs should not be treated as tokens
  #5031 - ChatGPT recommender fails because format is not a supported parameter
  #5029 - Duplicate lines on the about page
  #5027 - Add more CSP configurations
  #4753 - Entity linker should skip already linked concepts
  #5007 - Lazy details on suggestions for multi-value concept features fail rendering
  ...
@reckart reckart reopened this Sep 29, 2024
@github-project-automation github-project-automation bot moved this from 🍹 Done to 🏃‍♀️ In progress in Kanban Sep 29, 2024
reckart added a commit that referenced this issue Sep 29, 2024
- Fix case where there is only a single boundary given
reckart added a commit that referenced this issue Sep 30, 2024
- Fix case where there is only a single boundary given
reckart added a commit that referenced this issue Sep 30, 2024
- Fix case where there is only a single boundary given
reckart added a commit that referenced this issue Sep 30, 2024
…to-specify-token-breaking-zones-when-calling-tokenizer

#5043 - Ability to specify token breaking zones when calling tokenizer
@reckart reckart closed this as completed Sep 30, 2024
@github-project-automation github-project-automation bot moved this from 🏃‍♀️ In progress to 🍹 Done in Kanban Sep 30, 2024
reckart added a commit that referenced this issue Sep 30, 2024
* release/34.x:
  #5043 - Ability to specify token breaking zones when calling tokenizer
  #5064 - Project template for PHI annotation
  #5071 - Better document which layers are supported by which formats
  #5064 - Project template for PHI annotation
reckart added a commit that referenced this issue Oct 1, 2024
…ases

* main: (110 commits)
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare release inception-34.0
  No issue: Mention MS SQL and PosgreSQL as experimental DB options
  No issue: Fix versions after merge
  #5043 - Ability to specify token breaking zones when calling tokenizer
  #5064 - Project template for PHI annotation
  #5071 - Better document which layers are supported by which formats
  #5061 - Multiple synchronous recommenders only the last one wins
  #5064 - Project template for PHI annotation
  #5068 - Show annotation sidebar by default
  #5033 - Ability to configure recommenders interactively on the annotation page
  #5066 - Increase default number of rows for brat editors
  #5056 - Ability to configure additional languages for knowledge bases
  #4909 - Upgrade dependencies
  #5056 - Ability to configure additional languages for knowledge bases
  #4909 - Upgrade dependencies
  No issue: Actually, document-metadata doesn't seem to be experimental after all...
  No issue: Set version to 35.0-SNAPSHOT
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare release inception-33.7
  ...
reckart added a commit that referenced this issue Oct 1, 2024
…-JSON

* main: (308 commits)
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare release inception-34.0
  No issue: Mention MS SQL and PosgreSQL as experimental DB options
  No issue: Fix versions after merge
  #5043 - Ability to specify token breaking zones when calling tokenizer
  #5064 - Project template for PHI annotation
  #5071 - Better document which layers are supported by which formats
  #5061 - Multiple synchronous recommenders only the last one wins
  #5064 - Project template for PHI annotation
  #5068 - Show annotation sidebar by default
  #5033 - Ability to configure recommenders interactively on the annotation page
  #5066 - Increase default number of rows for brat editors
  #5056 - Ability to configure additional languages for knowledge bases
  #4909 - Upgrade dependencies
  #5056 - Ability to configure additional languages for knowledge bases
  #4909 - Upgrade dependencies
  No issue: Actually, document-metadata doesn't seem to be experimental after all...
  No issue: Set version to 35.0-SNAPSHOT
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare release inception-33.7
  ...

% Conflicts:
%	inception/inception-schema/src/main/java/de/tudarmstadt/ukp/inception/schema/exporters/LayerExporter.java
%	inception/inception-schema/src/main/java/de/tudarmstadt/ukp/inception/schema/exporters/TagSetExporter.java
%	inception/inception-ui-project/src/main/java/de/tudarmstadt/ukp/clarin/webanno/ui/project/layers/LayerDetailForm.java
%	inception/inception-ui-project/src/main/java/de/tudarmstadt/ukp/clarin/webanno/ui/project/layers/ProjectLayersPanel.java
reckart added a commit that referenced this issue Oct 1, 2024
* main: (30 commits)
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare release inception-34.0
  No issue: Mention MS SQL and PosgreSQL as experimental DB options
  No issue: Fix versions after merge
  #5043 - Ability to specify token breaking zones when calling tokenizer
  #5064 - Project template for PHI annotation
  #5071 - Better document which layers are supported by which formats
  #5061 - Multiple synchronous recommenders only the last one wins
  #5064 - Project template for PHI annotation
  #5068 - Show annotation sidebar by default
  #5033 - Ability to configure recommenders interactively on the annotation page
  #5066 - Increase default number of rows for brat editors
  #5056 - Ability to configure additional languages for knowledge bases
  #4909 - Upgrade dependencies
  #5056 - Ability to configure additional languages for knowledge bases
  #4909 - Upgrade dependencies
  No issue: Actually, document-metadata doesn't seem to be experimental after all...
  No issue: Set version to 35.0-SNAPSHOT
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare release inception-33.7
  ...
reckart added a commit that referenced this issue Oct 1, 2024
…he-Annotator-editor

* main: (330 commits)
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare release inception-34.0
  No issue: Mention MS SQL and PosgreSQL as experimental DB options
  No issue: Fix versions after merge
  #5043 - Ability to specify token breaking zones when calling tokenizer
  #5064 - Project template for PHI annotation
  #5071 - Better document which layers are supported by which formats
  #5061 - Multiple synchronous recommenders only the last one wins
  #5064 - Project template for PHI annotation
  #5068 - Show annotation sidebar by default
  #5033 - Ability to configure recommenders interactively on the annotation page
  #5066 - Increase default number of rows for brat editors
  #5056 - Ability to configure additional languages for knowledge bases
  #4909 - Upgrade dependencies
  #5056 - Ability to configure additional languages for knowledge bases
  #4909 - Upgrade dependencies
  No issue: Actually, document-metadata doesn't seem to be experimental after all...
  No issue: Set version to 35.0-SNAPSHOT
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare release inception-33.7
  ...
reckart added a commit that referenced this issue Oct 1, 2024
…causes-problems-in-search

* main: (32 commits)
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare release inception-34.0
  No issue: Mention MS SQL and PosgreSQL as experimental DB options
  No issue: Fix versions after merge
  #5043 - Ability to specify token breaking zones when calling tokenizer
  #5064 - Project template for PHI annotation
  #5071 - Better document which layers are supported by which formats
  #5061 - Multiple synchronous recommenders only the last one wins
  #5064 - Project template for PHI annotation
  #5068 - Show annotation sidebar by default
  #5033 - Ability to configure recommenders interactively on the annotation page
  #5066 - Increase default number of rows for brat editors
  #5056 - Ability to configure additional languages for knowledge bases
  #4909 - Upgrade dependencies
  #5056 - Ability to configure additional languages for knowledge bases
  #4909 - Upgrade dependencies
  No issue: Actually, document-metadata doesn't seem to be experimental after all...
  No issue: Set version to 35.0-SNAPSHOT
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare release inception-33.7
  ...

% Conflicts:
%	inception/inception-ui-project/src/main/java/de/tudarmstadt/ukp/clarin/webanno/ui/project/layers/ProjectLayersPanel.html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⭐️ Enhancement New feature or request
Projects
Archived in project
Development

No branches or pull requests

1 participant