Correct Tokenization #1707

Sachin04011974 · 2020-05-18T16:27:49Z

I am running INCEpTION to get some Named Entities, It is doing really good job of IOB2 tagging of those Named Entities. However it is saying its using StanfordNLP tokenization. However the tokenization with special characters/symbols could have been better.

for e.g.
Token - As Is --> To Be

Date - '06', '/', '01', '/', '2019' -> 6/1/2019
Email Address - 'xyz', '@', 'mydomain.com' --> [email protected]*
US format Telephone Numbers - '123', '-', '456', '-', '7890' --> 123-456-7890*

*True values masked

We tokenized the same text using StanfordNLP Tokenizer and its giving correct tokens (as mentioned under To Be)

Is there a setting through which we achieve the correct tokenization or this is feature not currently incorporated. Please let us know. That will be a great help.

~Sachin

jcklie · 2020-05-26T12:51:20Z

INCEpTION internally uses the Java Breakiterator when you import documents as plain text. Where did you find information about us using StanfordNLP tokenization? In case you need your own tokenzitation, you need to convert your document into a tokenized file format, e.g. webanno tsv or conll2002. You can get a list of file formats and description in the INCEpTION documentation.

reckart · 2020-07-28T09:15:26Z

@jcklie Let's close this one and open a separate one for actually making tokenization editable in the brat view?

jcklie · 2020-07-28T09:23:19Z

@reckart Ok. I created #1778

- Fixed layout if arc labels fall back to layer names - Changed color of arc labels from the arc color to black to make them better readable, in particular when selected

…s-not-having-a-label-is-broken #1707 - Layout for relations not having a label is broken

* 3.6.x: #1707 - Layout for relations not having a label is broken

* master: #1707 - Layout for relations not having a label is broken #1668 - Upgrade dependencies (3.6.6)

- Fixed layout if arc labels fall back to layer names - Changed color of arc labels from the arc color to black to make them better readable, in particular when selected

* commit '608c37c619cb677952908154d813f61ac2b34a1e': [maven-release-plugin] prepare for next development iteration [maven-release-plugin] prepare release webanno-4.0.0-beta-14 Bump spring.security.version from 5.2.3.RELEASE to 5.3.3.RELEASE #1472 - Upgrade dependencies (4.0.0) #1472 - Upgrade dependencies (4.0.0) #1472 - Upgrade dependencies (4.0.0) #1707 - Layout for relations not having a label is broken #1712 Auto logout not succeeding % Conflicts: % pom.xml

reckart added the Support request User has a problem and needs help label May 26, 2020

jcklie pushed a commit that referenced this issue Feb 2, 2021

Merge pull request #1708 from webanno/bugfix/1707-Layout-for-relation…

14af146

…s-not-having-a-label-is-broken #1707 - Layout for relations not having a label is broken

jcklie pushed a commit that referenced this issue Feb 2, 2021

Merge branch '3.6.x'

bb4824a

* 3.6.x: #1707 - Layout for relations not having a label is broken

jcklie pushed a commit that referenced this issue Feb 2, 2021

Merge branch 'master' into feature/1703-Improve-brat-rendering

3fdd3b0

* master: #1707 - Layout for relations not having a label is broken #1668 - Upgrade dependencies (3.6.6)

reckart closed this as completed Feb 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct Tokenization #1707

Correct Tokenization #1707

Sachin04011974 commented May 18, 2020 •

edited

Loading

jcklie commented May 26, 2020

reckart commented Jul 28, 2020

jcklie commented Jul 28, 2020 •

edited

Loading

Correct Tokenization #1707

Correct Tokenization #1707

Comments

Sachin04011974 commented May 18, 2020 • edited Loading

jcklie commented May 26, 2020

reckart commented Jul 28, 2020

jcklie commented Jul 28, 2020 • edited Loading

Sachin04011974 commented May 18, 2020 •

edited

Loading

jcklie commented Jul 28, 2020 •

edited

Loading