Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct Tokenization #1707

Closed
Sachin04011974 opened this issue May 18, 2020 · 3 comments
Closed

Correct Tokenization #1707

Sachin04011974 opened this issue May 18, 2020 · 3 comments
Labels
Support request User has a problem and needs help

Comments

@Sachin04011974
Copy link

Sachin04011974 commented May 18, 2020

I am running INCEpTION to get some Named Entities, It is doing really good job of IOB2 tagging of those Named Entities. However it is saying its using StanfordNLP tokenization. However the tokenization with special characters/symbols could have been better.

for e.g.
Token - As Is --> To Be

  1. Date - '06', '/', '01', '/', '2019' -> 6/1/2019
  2. Email Address - 'xyz', '@', 'mydomain.com' --> [email protected]*
  3. US format Telephone Numbers - '123', '-', '456', '-', '7890' --> 123-456-7890*

*True values masked

We tokenized the same text using StanfordNLP Tokenizer and its giving correct tokens (as mentioned under To Be)

Is there a setting through which we achieve the correct tokenization or this is feature not currently incorporated. Please let us know. That will be a great help.

~Sachin

@reckart reckart added the Support request User has a problem and needs help label May 26, 2020
@jcklie
Copy link
Contributor

jcklie commented May 26, 2020

INCEpTION internally uses the Java Breakiterator when you import documents as plain text. Where did you find information about us using StanfordNLP tokenization? In case you need your own tokenzitation, you need to convert your document into a tokenized file format, e.g. webanno tsv or conll2002. You can get a list of file formats and description in the INCEpTION documentation.

@reckart
Copy link
Member

reckart commented Jul 28, 2020

@jcklie Let's close this one and open a separate one for actually making tokenization editable in the brat view?

@jcklie
Copy link
Contributor

jcklie commented Jul 28, 2020

@reckart Ok. I created #1778

jcklie pushed a commit that referenced this issue Feb 2, 2021
- Fixed layout if arc labels fall back to layer names
- Changed color of arc labels from the arc color to black to make them better readable, in particular when selected
jcklie pushed a commit that referenced this issue Feb 2, 2021
…s-not-having-a-label-is-broken

#1707 - Layout for relations not having a label is broken
jcklie pushed a commit that referenced this issue Feb 2, 2021
* 3.6.x:
  #1707 - Layout for relations not having a label is broken
jcklie pushed a commit that referenced this issue Feb 2, 2021
* master:
  #1707 - Layout for relations not having a label is broken
  #1668 - Upgrade dependencies (3.6.6)
jcklie pushed a commit that referenced this issue Feb 2, 2021
- Fixed layout if arc labels fall back to layer names
- Changed color of arc labels from the arc color to black to make them better readable, in particular when selected
jcklie pushed a commit that referenced this issue Feb 2, 2021
* commit '608c37c619cb677952908154d813f61ac2b34a1e':
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare release webanno-4.0.0-beta-14
  Bump spring.security.version from 5.2.3.RELEASE to 5.3.3.RELEASE
  #1472 - Upgrade dependencies (4.0.0)
  #1472 - Upgrade dependencies (4.0.0)
  #1472 - Upgrade dependencies (4.0.0)
  #1707 - Layout for relations not having a label is broken
  #1712 Auto logout not succeeding

% Conflicts:
%	pom.xml
@reckart reckart closed this as completed Feb 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Support request User has a problem and needs help
Projects
None yet
Development

No branches or pull requests

3 participants